-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils/prepare_lang.sh --phone-symbol-table crash if the symbol file has no #0 #4344
Comments
are you sure you want filter out/check for duplicate strings? I'm not sure
if there would be a problem in general, as most of the kaldi cares about
the indices only...
Not sure -- just asking.
y.
…On Fri, Nov 20, 2020 at 4:54 AM kkm000 ***@***.***> wrote:
The message from an FST binary doesn't clearly point to the cause. An easy
fix, I'll do.
The tool also trusts the index value of #0 being 1 larger than the last
phone symbol, which better be checked. We generally try to validate
everything user-supplied as much as possible.
I'm thinking of adding a utility script to validate FST symbol tables in
general, to make sure a file does not contain duplicate strings or
duplicate indexes, that 0 is <eps> and so on. There are a couple places
where a more thorough check is done, a couple other where it's half-done,
and this one does not do much checking at all.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4344>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX553IWBT3CDA7CF5VTSQY4GZANCNFSM4T4RWKGA>
.
|
Yeah, I think duplicate strings should probably be warning rather than an
error.
We already check for duplicate ids, search for "duplicates" in
validate_lang.pl.
Where do we require that #0 is 1 larger than the last phone symbol?
I don't believe that is a requirement.
…On Fri, Nov 20, 2020 at 11:46 PM jtrmal ***@***.***> wrote:
are you sure you want filter out/check for duplicate strings? I'm not sure
if there would be a problem in general, as most of the kaldi cares about
the indices only...
Not sure -- just asking.
y.
On Fri, Nov 20, 2020 at 4:54 AM kkm000 ***@***.***> wrote:
> The message from an FST binary doesn't clearly point to the cause. An
easy
> fix, I'll do.
>
> The tool also trusts the index value of #0 being 1 larger than the last
> phone symbol, which better be checked. We generally try to validate
> everything user-supplied as much as possible.
>
> I'm thinking of adding a utility script to validate FST symbol tables in
> general, to make sure a file does not contain duplicate strings or
> duplicate indexes, that 0 is <eps> and so on. There are a couple places
> where a more thorough check is done, a couple other where it's half-done,
> and this one does not do much checking at all.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#4344>, or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ACUKYX553IWBT3CDA7CF5VTSQY4GZANCNFSM4T4RWKGA
>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4344 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7CDYKBJU5NZO74QM3SQ2FMHANCNFSM4T4RWKGA>
.
|
Yeah, a warning would be fine by me, too.
We in fact do not, it's not a requirement. It's just how the tool happens to work: greps for the '#0' and uses its index as the base for additional disambiguators, adding 1 for the next if it does not exist. So, as written, '#0' should better follow the symbols. This is only for the case of invoking it with the kaldi/egs/wsj/s5/utils/prepare_lang.sh Lines 317 to 323 in 0c6a3dc
|
The message from an FST binary doesn't clearly point to the cause. An easy fix, I'll do.
The tool also trusts the index value of
#0
being 1 larger than the last phone symbol, which better be checked. We generally try to validate everything user-supplied as much as possible.I'm thinking of adding a utility script to validate FST symbol tables in general, to make sure a file does not contain duplicate strings or duplicate indexes, that 0 is
<eps>
and so on. There are a couple places where a more thorough check is done, a couple other where it's half-done, and this one does not do much checking at all.The text was updated successfully, but these errors were encountered: