Skip to content

spectral feature script: parallelize, test at scale. #51

@msonderegger

Description

@msonderegger

While you (@MichaelGoodale ) and Michael M are still working, I would like to optimize the spectral features R script referred to here:
https://iscan.readthedocs.io/en/latest/tutorials_iscan.html#tutorial-4-custom-scripts

and test that tabular import/export works at scale -- on one of our large corpora -- in reasonable time.

The barrier to doing this testing before was that the script is slow: it loops through every row of the CSV to calculate spectral features, one row at a time. you should:

  1. modify the script so it can run in parallel, with a user-specified number of cores. Just add a binary flag and nCores as arguments the user has to fill in at top of the script, and make the default no parallelization (so the demo you wrote for ISCAN RTD works out of box). doParallel/forEach is one way to do this, I think.

  2. run the script for all sibilants from two large corpora which also have different phonesets -- let's say Buckeye and SOTC. So you'd need to tabular export info about all sibilants, then run using parallel option on roquefort.

2a. Optional: if this takes too long on roquefort, figure out how to do it on the compute canada cluster. (I have a working example somewhere of R in parallel on CC cluster if needed.) The main issue here might be having enough space to store datasets on CC servers.. let me know if it's an issue.

  1. tabular import for the two corpora

  2. for the two corpora, do the same export as in sibilants.py (all word-initial stressed-syllable sibilants etc., one column gives speech rate, another the word label...), but exporting all the new measures (calculated with the R script) rather than the Praat-script-calculated measures we've used before.

While you are doing this, please keep a record of how long steps 2, 3, and 4 (each) take -- to assess feasibility of getting these measures across many corpora.

  • modify script to allow running in parallel
  • tabular export for Buckeye, SOTC
  • run R script for Buckeye
  • run R script for SOTC
  • import results back into ISCAN-accessible databases on roquefort
  • do export of these measures as in sibilants.py

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions