xeno-canto-py is an API wrapper designed to help users download xeno-canto.org recordings and associated information in an efficient manner. Download requests are processed concurrently using the asyncio
, aiohttp
and aiofiles
libraries to optimize retrieval time. The wrapper also offers delete and metadata generation functions for recording library management.
Created to aid in data collection and filtering for the training of machine learning models.
xeno-canto-py is available on PyPi and can be downloaded with the package manager pip to install xeno-canto-py.
pip install xeno-canto
The package can then be used straight from the command-line:
xeno-canto -dl Bearded Bellbird
Or imported into an existing Python project:
import xenocanto
For users who want more control over the wrapper, navigate to your desired file location in a terminal window and then clone the repository with the following command:
git clone https://github.com/ntivirikin/xeno-canto-py
The only file required for operation is xenocanto.py
, so feel free to remove the others or move xenocanto.py
to another working directory.
WARNING: Please exercise caution using test.py
as executing the tests via unittest
or other test harness will delete any dataset
folder in the working directory following completion of the tests.
The xeno-canto-py wrapper supports the retrieval of metadata and audio from the xeno-canto database, as well as library management functions such as deletion of recordings matching input tags, removal of folders with an insufficient amount of audio recordings and generation of a single JSON metadata file for a given path containing xeno-canto audio recordings. Examples of command usage are given below.
Metadata Download
xeno-canto -m [parameters]
Downloads metadata as a series of JSON files and returns the path to the metadata folder.
Example: Metadata retrieval for Bearded Bellbird recordings of quality A
xeno-canto -m Bearded Bellbird q:A
Audio Recording Download
xeno-canto -dl [parameters]
Retrieves the metadata for the request and uses it to download audio recordings as MP3s from the database.
Example: Download Bearded Bellbird recordings from the country of Brazil
xeno-canto -dl Bearded Bellbird cnt:Brazil
Delete Recordings
xeno-canto -del [parameters]
Delete recordings with ANY of the parameters given as input.
Example: Delete ALL quality D recordings and ALL recordings from Brazil
xeno-canto -del q:D cnt:Brazil
Purge Folders
Removes any folders within the dataset/audio/
directory that have less recordings than the input value num
.
xeno-canto -p [num]
Example: Remove recording folders with less than 10 recordings (not inclusive)
xeno-canto -p 10
Generate Metadata
Generates metadata for the xeno-canto database recordings at the input path, defaulting to dataset/audio/
within the working directory if none is given.
xeno-canto -g [path]
Example: Generate metadata for the recordings located in bird_rec/audio/
within the working directory
xeno-canto -g bird_rec/audio/
parameters
are given in tag:value form in accordance with the API search guidelines. For help in building search terms, consult the xeno-canto API guide and this article. The only exception is when providing English bird names as an argument to the delete function, which must be preceded with en:
and have all spaces be replaced with underscores.
Files are saved in the working directory under the folder dataset/
. Metadata and audio recordings are separated into metadata/
and audio/
folders by request information and bird species respectively. For example:
dataset/
- audio/
- Indigo Bunting/
- 14325.mp3
- Northern Cardinal/
- 8273.mp3
- metadata/
- library.json
- IndigoBuntingcnt_Canada/
- page1.json
- NorthernCardinalq_A/
- page1.json
Metadata is retrieved as a JSON file and contains information on each of the audio recordings matching the request parameters provided as input. The metadata also contains the download links used to retrieve the audio recordings. The library.json
file is generated by running the metadata generation command -g
.
If an Error 503 is given when attempting a recording download, try passing a value lower than 4 as the num_chunks value in download(filt, num_chunks). This can either be done by changing the default value in the function definition for download(filt, num_chunks)
, or by passing a value into download(params)
in the body of main()
as shown below.
# Running with default 4 locks on semaphore
asyncio.run(download(params))
# Running with 3 locks rather than default
asyncio.run(download(params, 3))
Alternatively, you can try experimenting with higher values for num_chunks to see some performance improvements.
All pull requests are welcome! If any issues are found, please do not hesitate to bring them to my attention.
Thank you to the team at xeno-canto.org and all its contributors for putting together such an amazing database.