-
Notifications
You must be signed in to change notification settings - Fork 4
Description
While you (@MichaelGoodale ) and Michael M are still working, I would like to optimize the spectral features R script referred to here:
https://iscan.readthedocs.io/en/latest/tutorials_iscan.html#tutorial-4-custom-scripts
and test that tabular import/export works at scale -- on one of our large corpora -- in reasonable time.
The barrier to doing this testing before was that the script is slow: it loops through every row of the CSV to calculate spectral features, one row at a time. you should:
-
modify the script so it can run in parallel, with a user-specified number of cores. Just add a binary flag and nCores as arguments the user has to fill in at top of the script, and make the default no parallelization (so the demo you wrote for ISCAN RTD works out of box).
doParallel/forEachis one way to do this, I think. -
run the script for all sibilants from two large corpora which also have different phonesets -- let's say Buckeye and SOTC. So you'd need to tabular export info about all sibilants, then run using parallel option on roquefort.
2a. Optional: if this takes too long on roquefort, figure out how to do it on the compute canada cluster. (I have a working example somewhere of R in parallel on CC cluster if needed.) The main issue here might be having enough space to store datasets on CC servers.. let me know if it's an issue.
-
tabular import for the two corpora
-
for the two corpora, do the same export as in sibilants.py (all word-initial stressed-syllable sibilants etc., one column gives speech rate, another the word label...), but exporting all the new measures (calculated with the R script) rather than the Praat-script-calculated measures we've used before.
While you are doing this, please keep a record of how long steps 2, 3, and 4 (each) take -- to assess feasibility of getting these measures across many corpora.
- modify script to allow running in parallel
- tabular export for Buckeye, SOTC
- run R script for Buckeye
- run R script for SOTC
- import results back into ISCAN-accessible databases on roquefort
- do export of these measures as in sibilants.py