This repository contains the pipeline to obtain and process ONC data from hydrophones.
In order to run this code, you will need a token from the ONC API. It is freely available once you have signed up in the Oceans 3.0 API.
Hint: You can get your token from the Web Services API tab in your profile. Please refer to Oceans 3.0 API Home for the complete documentation
The dataset generation pipeline is divided in steps. Each step can be performed separately, but some of the steps are pre-requisites of the next ones.
Attention: Is HIGHLY recommended to have a large storage available to download the ONC WAV Files.
A Dockerfile is available at this repository to simplify the environment setup. As the needed sources are only Python dependencies, a virtual environment can also be created.
To install the dependencies, run the following commands:
pip install -r requirements.txt
pip install -r requirements-dev.txt
To generate a complete dataset you have to run the pipeline steps that are related to your needs. The config.py
file contains all the setting needed to adapt the dataset generation pipeline.
If you need data from other deployments than the ones used in this project, please refer to the ONC Data Search website and find the desired device codes. You can alter this codes in the config.py
file.
Steps 0 to 7 are the basic steps to download and process the AIS data, the WAV files and syncronize them. Steps 8 and 9 are optional and only needed if the CTD information is desired. To generate a metadata with the annotations in a .csv
format, you should run the step 10.
Examples:
To run a pipeline with the CTD information:
In the config.py file, change the STEPS
variable to STEPS=[0,1,2,3,4,5,6,7,8,9,10]
and set USE_CTD=True
. Then run:
python src/main.py
To run a pipeline without the CTD information:
In the config.py file, change the STEPS
variable to STEPS=[0,1,2,3,4,5,6,7,10]
(removing 8 and 9) and set USE_CTD=False
. Then run:
python src/main.py
PS: After stage 0 you will have a .csv
file containing the information about different deployments of the desired devices (hydrophones). Always check if the coordinates and dates are the desired ones. You can directly delete rows that are not needed for your specific application. The remaining of the pipeline will use all the deployments specified in the .csv
files.
Steps 11, 12, and 13 are only metadata management. They will balance the metadata, split into smaller segments, and split the data into train, validation and test subsets. Note that this won't affect your downloaded data (AIS, WAV, and CTD), it will only produce a new .csv
metadata file with the new configuration.
A brief pipeline description can be found below, splitting the process into 13 steps:
- Query the ONC server for the deployments of the choosen hydrophones;
- Read the following information: recording begin, recording end, latitude, longitude, depth, and location;
- Save the information into a
.csv
file.
- Search for AIS data from the date choosen;
- Download the
txt
files from ONC.
This function parse the ais messages downloaded from ONC into JSON files, filtering by the type of the messages and discarting messages without the needed values.
- Find the downloaded
.txt
AIS files; - Keep only the relevant messages. They are: Position report, Static and voyage related data, Standard Class B equipment position report, Extended Class B equipment position report, and Static data report.
- Filter from those messages only a few informations: Positioning (x and y), SOG, COG, true heading, and type and cargo codes;
- Save the corresponding information into a
.json
file.
- Read the
.json
files into dataframes; - Propagate the 'type_and_cargo' messages throughout the MMSI's;
- Drop messages without positional coordinates and/or duplicates;
- Calculate the distance from the hydrophone to the vessel;
- Filter only the data that fits the choosen scenario;
- Save the corresponding information into a
.feather
file.
- Read the cleaned AIS files;
- Removes the Vessel entries that have just one message;
- Dump AIS data to a monolithic
.feather
file; - Generate a new data with linearly interpolated values to obtain more granularity;
- Combine the raw and interpolated data frames;
- Dump AIS interpolated data to a monolithic
.feather
file;
- Find all of the cleaned AIS files for each deployment;
- Find the time intervals where only one vessel is within range;
- Search for WAV data from the chosen scenario;
- Download the
.wav
files from ONC.
WARNING: The need for disk memory is dependent of the size of you deployment and the date chosen. Be sure to have enough space
- Read
.csv
file to extract periods of timestamp; - Search on raw WAV files folder for the correct period of time;
- Read the wav files and split into 1 minute normalized pieces of audio;
- Group the pieces of audio with the period of ais files range;
- Save into correct folder.
- Search for CTD data from the date choosen;
- Download from ONC.
- Select only information of salinity, conductivity, temperature, pressure, and sound speed;
- Save the corresponding information into a
.feather
file.
- Get the following information from each time period: label, duration, file path, sample rate, class code, date, MMSI;
- Get also a average for the time period of the CTD data: salinity, conductivity, temperature, pressure, and sound speed;
- Normalize the CTD information;
- Save all the data into a
.csv
file.
- Count the occurrences of each class;
- Do a undersample strategy to crop the longer classes according to the smaller one;
- Save all the data into a
.csv
file.
- Read the original metadata generated on Step 10;
- Create a new column named sub_init to accomodate the time frame where this new entry will start;
- Split the
.csv
row according with the duration choosen; - Create a new row for each new entry;
- Save all the data into a
.csv
file.
- Read all the metadata;
- Apply a random sort on the data;
- Save all the data into three
.csv
files: Train, Validation, and Test.
The results from this work were published at IEEE Access, at the following reference:
@article{domingos2022investigation,
author={Domingos, Lucas C. F. and Santos, Paulo E. and Skelton, Phillip S. M. and Brinkworth, Russell S. A. and Sammut, Karl},
journal={IEEE Access},
title={An Investigation of Preprocessing Filters and Deep Learning Methods for Vessel Type Classification With Underwater Acoustic Data},
year={2022},
volume={10},
number={},
pages={117582-117596},
doi={10.1109/ACCESS.2022.3220265}}
A complete literature review containing the background knowledge of this work is available on the following reference:
@article{domingos2022survey,
author={Domingos, Lucas C. F. and Santos, Paulo E. and Skelton, Phillip S. M. and Brinkworth, Russell S. A. and Sammut, Karl},
title={A Survey of Underwater Acoustic Data Classification Methods Using Deep Learning for Shoreline Surveillance},
volume={22},
ISSN={1424-8220},
url={http://dx.doi.org/10.3390/s22062181},
DOI={10.3390/s22062181},
number={6},
publisher={MDPI AG},
journal={Sensors},
year={2022},
month={Mar},
pages={2181}
}
This code, as well as the pipeline formulation and the code used as the basis, was developed in collaboration with Phillip Skelton.
Thanks to Paulo Santos for the guidance and participation in this project.