Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils/prepare_lang.sh --phone-symbol-table crash if the symbol file has no #0 #4344

Open
kkm000 opened this issue Nov 20, 2020 · 3 comments
Open
Assignees
Labels
bug in progress Issue has been taken and is being worked on Priority: lower It will take us a while to get to this, be patient

Comments

@kkm000
Copy link
Contributor

kkm000 commented Nov 20, 2020

The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.

The tool also trusts the index value of #0 being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.

I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is <eps> and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all.

@kkm000 kkm000 added bug in progress Issue has been taken and is being worked on Priority: lower It will take us a while to get to this, be patient labels Nov 20, 2020
@kkm000 kkm000 self-assigned this Nov 20, 2020
@jtrmal
Copy link
Contributor

jtrmal commented Nov 20, 2020 via email

@danpovey
Copy link
Contributor

danpovey commented Nov 21, 2020 via email

@kkm000
Copy link
Contributor Author

kkm000 commented Nov 22, 2020

Yeah, a warning would be fine by me, too.

@danpovey

Where do we require that #0 is 1 larger than the last phone symbol? I don't believe that is a requirement.

We in fact do not, it's not a requirement. It's just how the tool happens to work: greps for the '#0' and uses its index as the base for additional disambiguators, adding 1 for the next if it does not exist. So, as written, '#0' should better follow the symbols. This is only for the case of invoking it with the --phone-symbol-table switch.

# Create phone symbol table.
if [[ ! -z $phone_symbol_table ]]; then
start_symbol=`grep \#0 $phone_symbol_table | awk '{print $2}'`
echo "<eps>" | cat - $dir/phones/{silence,nonsilence}.txt | awk -v f=$phone_symbol_table '
BEGIN { while ((getline < f) > 0) { phones[$1] = $2; }} { print $1" "phones[$1]; }' | sort -k2 -g |\
cat - <(cat $dir/phones/disambig.txt | awk -v x=$start_symbol '{n=x+NR-1; print $1, n;}') > $dir/phones.txt
else

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug in progress Issue has been taken and is being worked on Priority: lower It will take us a while to get to this, be patient
Projects
None yet
Development

No branches or pull requests

3 participants