Skip to content

Conversation

moeheart
Copy link
Contributor

Suppose there is an n-gram A-B. If B is not in the vocabulary list, it should be treated as <unk>, or as the -inf prob. In the latter case, it seems that A-B should also be -inf, but in current code, it is actually backoff(A).
Usually this happens when the language model for rescoring is not the superset of the language model for decoding. Therefore, some words in the lattice may not be recognized when rescoring.
Take an example to show the bad effect. Suppose 1,2,3,... are the states in the lattices, and A,B,C are the words on the edge. Consider this lattice:
1---(A)---2---(B)---3---(C)---5
                \                           /
                  --(D)---4---(E)--
where A-D-E is supposed to be the correct result, and B is a low frequency word such that B does not appear in the vocabulary list. It can be imagined that the weight of 2-(B)-3 is large, so that the one-best algorithm is less likely to choose this path.
Now, in the rescoring procedure, we subtract the weight of them, so 2-(B)-3 is smaller than others. Normally we do not need to worry about that, because it may soon be added by another large number (as it is low frequency), or removed (as it does not appear in the vocabulary list).
However, since A-B will trigger the bug, the probability may be backoff(A), which means 2-(B)-3 is not so large, so the result may be mistakenly set to A-B-C.
This is also the reason that carpa differs from the fst-like structure. In the fst-like structure, edge near B will be removed if B is out of vocabulary. It would definitely reduce the error when the language model is small, but it is unexpected when the model is large.

@danpovey
Copy link
Contributor

I think the bug isn't there, I think the bug is at line 794: it does
return std::numeric_limits<float>::min();
but should probably do:
return -std::numeric_limits<float>::infinity();
(min() gives the smallest positive value).
-infinity plus a finite value is -infinity.
Can you test this alternative fix?

@moeheart
Copy link
Contributor Author

I think you are right. I will try this soon.

@jtrmal
Copy link
Contributor

jtrmal commented Jul 28, 2020

great find!

@moeheart
Copy link
Contributor Author

I have tested the modification, and it works as expected. Another place in function GetArc() is also modifed to get the identical result.

@danpovey
Copy link
Contributor

danpovey commented Jul 28, 2020 via email

@jtrmal
Copy link
Contributor

jtrmal commented Jul 28, 2020

@kkm000 please merge if you OK the change, it's your code (or touches your code)

@kkm000
Copy link
Contributor

kkm000 commented Jul 28, 2020

LGTM

@jtrmal, no, I wrote the compiler, I think carpa was a later addition, so I'm getting no credit for this bug code. :)


By the way, not related to this change but just in case, a tidbit to keep in mind about the messiness of std::numeric_limits<T> when the T is a floating-point type. The functions below return the values of a floating-point type T which are:

  • std::numeric_limits<T>::lowest(): the smallest representable value (huge negative number).
  • std::numeric_limits<T>::denorm_min(): the smallest positive representable value (very small positive number)
  • std::numeric_limits<T>::min() the smallest positive non-denormal number (also a very small positive number).
  • std::numeric_limits<T>::max() the largest representable value of type T (huge positive number).

In other words,

lowest() <= -max() < -min() <= -denorm_min() < ((T)0) < denorm_min() <= min() < max()
         ^[1]               ^[2]                                     ^[2]

[1] strictly less is an exotic case.
[2] strictly equal is an exotic case.

Also, for regular floats of 32, 64 and 80 bit, and, I think but not sure, for all IEEE-754[-like?] floating point numbers, lowest() == -max() holds. But this is not guaranteed to be true for all float types. The lowest() has been added to C++11 because of this. Also, the values of -denorm_min() and -min() are what you think they should be; no change here, up to an including C++20. The only exception is added for the max(), which is, apparently, in exotic cases, does not produce the most negative representable number when negated.

For reference, the table

T lowest() denorm_min() min() max()
float -3.40282e+38 1.4013e-45 1.17549e-38 3.40282e+38
double -1.79769e+308 4.94066e-324 2.22507e-308 1.79769e+308
long double -1.18973e+4932 3.6452e-4951 3.3621e-4932 1.18973e+4932

and the code to produce it, partly lifted from cppreference.com:

#include <limits>
#include <iostream>

int main()
{
  std::cout << "| `T` | `lowest()` | `denorm_min()` | `min()` | `max()` |\n"
            << "| ---: | ---: | ---: | ---: | ---: |\n"

            << "| float | "
            << std::numeric_limits<float>::lowest()     << " | "
            << std::numeric_limits<float>::denorm_min() << " | "
            << std::numeric_limits<float>::min()        << " | "
            << std::numeric_limits<float>::max()        << " |\n"

            << "| double | "
            << std::numeric_limits<double>::lowest()     << " | "
            << std::numeric_limits<double>::denorm_min() << " | "
            << std::numeric_limits<double>::min()        << " | "
            << std::numeric_limits<double>::max()        << " |\n"

            << "| long double | "
            << std::numeric_limits<long double>::lowest()     << " | "
            << std::numeric_limits<long double>::denorm_min() << " | "
            << std::numeric_limits<long double>::min()        << " | "
            << std::numeric_limits<long double>::max()        << " |\n";
}

And, in the end, min() was the worst choice here: it's a positive number, but not even the smallest positive number!


I think I should put into Wiki somewhere.

@kkm000 kkm000 merged commit f88d5a3 into kaldi-asr:master Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants