CHANGES

Version History

0.90 29 Jun 95 first working code, n-gram models only

0.91 02 Aug 95 snapshot for fosler@icsi, minor bug fixes

0.92 13 Aug 95 added BayesMix, VarNgram LMs

0.93 27 Aug 95 included all LM95 code

0.94 13 Oct 95

new directory structure mirroring DECIPHER layout.
man pages added
added support for Decipher N-best list rescoring
added Null LM
added new utility scripts
bug fixes

0.95 08 Sep 96 as of WS96

added Trellis class, disambig program
added support for pause tokens (-pau-) in sentences (these are ignored for sentence prob computation)
added -tolower mapping
added word reversal
made Ngram model reading much faster (optimized floating point parsing)
added template class for ngram count tries (to use either integer or float count value)
added optional noise tag skipping
added SkipNgram model
added Witten-Bell backoff
ported to native Sun and SGI C++ compilers (see doc/c++porting-notes),
suppress log10(0.0) warnings

0.96 05 Jun 97

Honor -gtNmin parameter even when discounting of higher counts is effectively disabled. (Allows building maximum likelihood LMs smoothed only by low-count ngram elimination.)
Ignore pauses and noise in nbest-lattice alignments (also added -noise option).
ngram now supports mixtures of up to 6 ngram models.
added HiddenSNgram LM.
warn about multiple uses of '-' file for input or output
zio now handles incomplete reading of compressed file without error
Fixed interaction between deletion and iterations
Fixed handling of OOVs in cache model
Fixed decipher N-best rescoring: we now duplicate even the roundoff errors incurred by bytelogs. Also added -decipher flag to ngram to allow replication of recognizer LM scores. Also, takes into account that Decipher (incorrectly) applies WTW even to pauses.
Enhanced decipher-rescore script to deal with NBestList2.0 format, with -bytelog and -nodecipherlm options .
Added tools to convert bigram and trigram backoff LMs into Decipher PFSG format (pfsg-from-ngram).
Enable DecipherNgram models order higher than bigram (ngram -decipher-order flag). Default is still bigram.
Fixed bug that caused float command line arguments to be parsed incorrectly on SunOS4 systems (missing declaration in system header).

0.97 30 Aug 97 as of WS97

New programs: segment and segment-nbest (moved here from development code).
Made low-level NgramLM access functions public (findProb, findBOW, insertProb, insertBOW).
Fixed nbest-lattice to use normalized posterior word probabilities in lattice.
NBest, nbest-lattice: added N-best error computation.
WordLattice, nbest-lattice: added lattice error computation.
WordLattice: base all alignments on edit distance costs defined in WordAlign.h.
contextID() now also returns length of context used. Added contextID() implementations for NullLM and BayesMix.
Fixed contextID() for Ngram: don't truncate context if BOW = 1.
Fixed SArray, LHash to avoid assignment operator on remove().
Fixed add-ppls, subtract-ppls to handle -ppl -debug 2 output.
Lots of memory management fixes.
SArrayIter and LHashIter now work even while underlying object is being moved (as when containing data structure is enlarged).
Added HTK Lattice tool interface (htk/ directory).
Made Trellis into a template class.
Allow arbitrary n-gram orders with disambig(1).
Added forward-backward decoding and posterior probability computation to disambig(1).
Added disambig -lmw and -mapw options.
Added HMMofNGrams model (ngram -hmm option).
VocabMap reader now warns about duplicate entries

0.98 18 April 98

Allow ngram to disable Decipher LM backoff hack, for rescoring new exact lattices (ngram -decipher-nobackoff).
N-best list vocabulary is now always expanded dynamically (no more OOVs in N-best lists).
Added wrapper script for nbest-lattice to compute N-best error rate (nbest-error).
Skip ngrams exceeding model order when reading.
Fixed memory bug in generateSentence().
Changed libmisc to work with Tcl version > 7.
Compute word error correctly for empty N-best list.
Added ngram pruning based on model perplexity change (ngram-count -prune and ngram -prune).
Old ngram -prune option renamed -varprune.
New lattice word error minimization (nbest-lattice -lattice-wer).
Fixed ngram -gen bug due to omissions in SunOS4 header files.
merge-batch-counts removes merged source files
Added ngram -prune-lowprobs function to do the work of remove-lowprob-ngrams, but much faster and using less memory.
Added support for new Decipher NBestList2.0 format.
Added word error count and posterior probability fields to NBestHyp structure.
Added optional factor argument to countSentence() (convenient to compute fractional sufficient statistics for alternative training methods).
Don't make special symbols (<s>, </s>, <unk>) member of SubVocab by default.
Ported to gcc 2.8.1 .

0.99 31 July 1999

Added hidden-ngram (word-boundary tagger).
Removed line length limit for File object.
Added disambig -continuous flag.
Fixed backward computation in disambig (again).
Generalized compute-best-mix to N > 2 models
Added AdaptiveMix LM class
Added nbest-mix utility (interpolation of N-best posteriors)
Added ngram -unk flag to handle open-class LMs
Added disambig and hidden-ngram -text-map option
Script enhancements: - New script to convert nbest-lattice word graphs to PFSG

(wlat-to-pfsg)
- Added switches include probabilities in wlat-to-dot and pfsg-to-dot output.
- Conversion to/from AT&T FSM format: fsm-to-pfsg and pfsg-to-fsm
ngram -rescore and associated scripts no longer set a hyp

probability to zero if it contains OOVs. Instead, the probability is computed ignoring those words (more useful in practice). A warning is output as always.
Added ngram-count -float-counts option.
Added build support for Linux/i686 platform.

1.00 8 June 2000

Added ClassNgram class and ngram -classes option.
Capability to convert class ngrams into word ngrams.
New program ngram-class for automatic word class induction.
Fixed interaction of ngram -mix-lm -bayes with non-standard n-grams: can now build an interpolation of the non-standard (hidden-event, class-based, etc.) n-gram with the additional, standard n-grams.
Replaced LM.noiseTag with LM.noiseVocab (list of noise tags to be ignored). Tools now take -noise-vocab option (as well as -noise for backward compatibility).
Made ngram -counts work for non-n-gram models.
Added nbest-lattice -posterior-{amw,lmw,wtw} options to compute word posteriors with different weightings from the one used in hypothesis ranking. Also added -deletion-bias flag for explicit control of del/ins errors (-use-mesh mode only).
NBest rescoring methods now have optional acoustic model weight (defaulting to 1 as before).
New class RefList (list of reference transcripts).
New class NBestSet (set of N-Best lists).
NBest, NBestSet, and nbest-lattice optionally split multiwords into their components on reading (-multiwords option).
New nbest-optimize tool for finding near-optimal score combination weights for word error minimizing N-best rescoring.
New anti-ngram program, for computing posterior-weighted N-gram counts from N-best lists.
New nbest-rover script allows ROVER-style combination of hypotheses from multiple N-best lists.
New rescore-decipher -norescore option, to reformat N-best lists without LM rescoring.
Fixed bugs related to missing <s> and </s> in change-lm-vocab and make-ngram-pfsg.
Significant speedups in LMs involving dynamic programming (HiddenNgram, DFNgram, HMMofNgrams) when interpolating with other models or running in "ngram -debug 2" mode.
Allow absolute discounting on fractional counts, for more effective construction of models from fractional counts.
Added ngram-merge -float-counts option, and allow "-" (stdin) as input file.
ngram-count ensures <s> unigram (with prob 0) is defined to avoid breaking other programs.
Added make-abs-discount script to compute absolute discounting constants from Good-Turing statistics.
compute-sclite and compare-sclite now take -multiwords option to split compound words prior to scoring.
Changed option handling so that unsigned option arguments are forced to be non-negative.
Added Map2 (2D Map) class to libdstruct.
Much better string hash function (borrowed from Tcl).
New man pages: training-scripts(1), lm-scripts(1), ppl-scripts(1), pfsg-scripts(1), nbest-scripts(1), lm-format(5), classes-format(5), pfsg-format(5), nbest-format(5).

1.0.1 12 July 2000

Functionality:

wordError() and nbest-lattice -dump-errors now also output the location of deletions in the alignment (NOTE: possible code incompatibility).

New reverse-ngram-counts script.

Bug fixes:

Workarounds for shortcomings in Linux gcc, math library, and linker.

make-ngram-pfsg: don't ignore bigram states with zero BOW (bugfix).

nbest-rover: fixed problem with handling of + lines.

1.1 21 May 2001

Functionality:

HiddenNgram class generalized to deal with disfluency-type events that manipulate the N-gram context.

rescore-reweight script now accepts additional score directories (and associated score weights) for combination of an arbitrary number of knowledge sources.

Enhanced rescore-decipher functionality: - Option -lm-only to produce output containing LM scores only - Option -pretty to perform word mapping on the fly. - Warn about and handle LM scores that are NaN.

New class VocabMultiMap, implementing dictionary-style mappings of words to strings from another vocabulary.

Added support for pronunciation-based word alignments in WordMesh and nbest-lattice -use-mesh .

Added nbest-lattice -keep-noise option to preserve pauses and noises in alignments.

Support for multiwords: - make-multiword-pfsg expands PFSGs to use multiwords (using AT&T FSM tools). - multi-ngram expands N-gram LM to include multiwords.

Added support for Decipher Intlog scaled log probabilities.

Added ngram -seed option to initialize random sentence generation (contributed by Eric Fosler).

New add-pauses-to-pfsg pause= and version= options to allow generation of Nuance-compatible PFSGs (see man page for details).

The NBest class and scripts handle NBestList2.0 format containing phone and/or state backtraces (by ignoring them).

Added Amoeba search option to nbest-optimize (contributed by Dimitra Vergyri).

Added standard 1-best optimization mode to nbest-optimize.

wlat-to-pfsg script now also processes confusion networks output by nbest-lattice -use-mesh .

Bug fixes:

ngram -decipher-nobackoff now applies to the -lm ngram as well if option -decipher is also specified.

ngram -expand-classes no longer dumps core when handling "context-free" class expansions (though those aren't supported).

gawk path in scripts is now adjusted prior to installation (/usr/bin/gawk for Linux, /usr/local/bin/gawk elsewhere).

Fixed numerical problems in nbest-rover/nbest-posteriors.

ngram-counts -float-counts behaved differently from equivalent integer-count estimation; both integer and float counts now use the same estimation code.

Reduced memory requirements of nbest-optimize by about 25%.

Minor changes for gcc-2.95.3.

1.1.1 20 July 2001

Functionality:

WordMesh: new interface to record reference word string in alignment.

nbest-lattice: confusion networks can now record reference words if specified with -reference, and are preserved by -write/-read.

replace-words-with-classes now has option to process ngram count files (have_counts=1).

merge-nbest: new utility to merge N-best hyps from multiple lists.

wlat-stats: new utility to compute statistics of word posterior lattices.

Bug fixes:

GT discounting: fixed anomaly due to different floating point precision on x86 platforms.

anti-ngram(1): documented options previously omitted.

WordMesh: reading/writing of confusion networks now preserves total posterior mass.

Changed the hypothesis alignment order in nbest-optimize to be more compatible with decoding in nbest-lattice: first align nbest hyps in order of decreasing (initial) scores, then align reference. nbest-optimize -no-reorder keeps the old behavior (with references anchoring the alignment). All scores and initial lambdas are now used to compute initial posterior hyp probabilities to guide the hypothesis alignment; thus, it now makes sense to restart an optimization with partially optimized weights to revised the alignments.

nbest-optimize now warns about missing or incomplete score files.

Fixed a memory access error in nbest-optimize -1best.

Fixed weight normalization in nbest-optimize when first element is 0.

Miscellaneous fixes for compile under RH Linux 7.0.

1.2 20 November 2001

Functionality:

nbest-lattice -dictionary allows word alignments to be guided by dictionary pronunciations.

nbest-lattice -use-mesh -record-hyps records the rank of N-best hyps contributing to each word hypothesis in the confusion network.

nbest-lattice -no-rescore and -decipher-format options make it more convenient as an N-best format conversion tool.

VocabDistance: new class and subclasses to represent distance metrics (e.g., phonetic distance) over vocabularies.

WordMesh: output word hyps in order of decreasing posteriors.

WordMesh: reading/writing of confusion networks now includes hyp IDs from alignment.

NBest/MultiAlign/WordMesh: support for keeping extra word-level information (NBeSTWordInfo).

nbest-lattice: unified single and multiple file processing. New option -write-dir to write multiple output lattices. New option -refs to supply multiple references. Options -nbest-errors and -lattice-errors are replaced by switches -nbest-error/-lattice-error, in conjunction with -references/-refs. Outputs are now prefixed by utterance IDs when processing multiple files.

nbest-lattice -nbest-backtrace enables processing of backtrace information from N-best lists; combined with -use-mesh this produces sausages that contain word-level scores and alignment information, as well as phone backtraces (see new wlat-format(5) man page).

wlat-stats script now also computes error statistics when processing confusion networks with references.

nbest-rover now handles N-best lists in Decipher format.

hidden-ngram and disambig: new option -fw-only to use only forward probabilities for posterior computation.

rescore-decipher -filter option to apply textual rewriting filters to hypotheses before rescoring.

segment-nbest -write-nbest-dir option for dumping rescored N-best lists to a directory instead of to stdout.

segment-nbest -start-tag and -end-tag options to insert tags at margins of N-best hyps.

Bug fixes:

WordMesh: computation of deletion costs using a dictionary distance was completely bogus (only affected undocumented nbest-lattice -dictionary option).

nbest-lattice: correctly process -nbest-files using -dictionary in alignment.

nbest-rover: fixed to work on Linux

hidden-ngram: don't abort when an event posterior is 0.

hidden-ngram: avoid abort when noevent occurs in -hidden-vocab list.

segment-nbest: now correctly uses ngram contexts longer than trigram.

segment-nbest: optimized -bias 0 case by disallowing sentence boundary states altogether.

multi-ngram -prune-unseen-ngrams prevents insertion of multiword N-grams whose component N-grams were not in the original model.

ngram: fixed computation of mixture lambda for second LM when three or more models are interpolated.

nbest-posterior (and thus nbest-rover) no longer split multiwords by themselves. To split multiwords with nbest-rover, append the -multiwords option to the argument list, which is passed on to nbest-lattice to achieve the desired effect.

ngram -renorm now applies BEFORE class expansion or pruning of model (in case input model is unnormalized).

make-nbest-pfsg bug involving transition into final node fixed.

Minor script changes to avoid warnings with gawk 3.1.0.

1.3 11 February 2002

Functionality:

Trellis class, disambig and hidden-ngram tools: added support for N-best decoding (contributed by Anand Venkataraman).

MultiwordLM wrapper LM class as a convenient way to split multiwords prior to LM evaluation.

New MultiwordVocab class to support MultiwordLM.

Added ngram -multiwords option (based on MultiwordLM wrapper).

Added support for Chen & Goodman's Modified Kneser-Ney smoothing and interpolated backoff estimates. See ngram-count options -kndiscount[1-6], -kn[1-6], and interpolate[1-6].

New library and tool for lattice manipulation: lattice-tool.

New nbest-mix -set-am-scores and -set-lm-scores options. These allow

setting either the AM or the LM scores in the N-best output to simulate the combined posteriors, while preserving the other scores.

Added some regression tests (test/ subdirectory).

Support for Windows via CYGWIN porting layer (MACHINE_TYPE=cygwin).

See doc/README.windows for details.

Bug fixes:

Trellis: deallocate old trellis nodes on demand in init(), rather than preemptively in clear(). Greatly speeds up forward computation for trellis-based LMs (e.g., ClassNgram).

Textstats: fix to handle zero denominator in ppl computation.

disambig: fixed off-by-one error indexing into trellis.

Miscellaneous small fixes for compilation and operation under Windows

(using the CYGWIN environment).

Warning: See doc/README.x86 about a gcc compiler bug that might affect you on Intel platforms.

1.3.1 25 June 2002

Functionality:

nbest-optimize -write-rover-control option conveniently dumps a

control file for nbest-rover that encodes the optimized parameters. * New regression tests for nbest-rover (i.e., nbest-lattice) and nbest-optimize. * nbest-posteriors, combine-acoustic-scores now all handle and preserve Decipher N-best formats. This allows nbest-rover to generate sausages with backtrace information if input N-best lists contain it (using -nbest-backtrace option). * New tool nbest-pron-score for computing pronunciation and pause LM scores from N-best hypotheses. * Added disambig -totals option to compute total string probabilities (same as in hidden-ngram). * reverse-lm: simple filter to reverse a bigram backoff LM. * lattice-tool -collapse-same-words reduces lattices by merging all nodes with identical words (but also creates new paths in lattice). * nbest-lattice -prime-with-refs option uses reference strings to improve sausage alignment. * compute-best-sentence-mix: new script to optimize sentence-level interpolation of LMs. * nbest-lattice -lattice-files option to align multiple word lattices; currently only works with -use-mesh (sausages). * hidden-ngram now supports mixture and class N-gram LMs. * New class SimpleClassNgram, a more efficient implementation of ClassNgram's where each word is assumed to belong to at most one class and class expansions are exactly one word long. Enabled by -simple-classes switch in ngram, lattice-tool, and hidden-ngram. * ngram -counts now handles escaped input lines and LM state change directives embedded in the input. * New tool nbest-pron-score for scoring pronunciations and pauses in N-best hypotheses. * NgramStats::parseNgram() new function to parse N-gram counts from a character string. * LM::pplCountsFile() new function to evaluate LM on counts read from a file.

Bug fixes:

make-ngram-pfsg is no longer limited to trigram models.

Avoid NaN values in disambig and hidden-ngram, in cases where lmw or

mapw are zero and the corresponding log probabilities are -Infinity. * Avoid numerical problems in N-best posterior computation by using AddLogP() to compute normalizer. * anti-ngram no longer requires -refs argument with -all-ngrams. * Fixed bug removing noise from N-best lists with backtrace. * Code fixes for clean compiles with gcc 3.x. * nbest-rover more efficient by using a single invocation of nbest-lattice for all input N-best lists. * ClassNgram: fixed handling of words that appear as members of a class with zero probability, or have zero membership probability. * nbest-lattice -record-hyps now outputs hyp ids according to the original N-best order, rather than the sorted one. * make-hiddens-lm now gives proper unigram probability to hidden-S tag. * Compute acoustic scores in Decipher N-best-2 format by subtracting token LM scores from total score. This deals correctly with cases where the total scores have been adjusted by summing merged hyps, and are no longer the sum of all AC and LM word scores. * Gawk scripts that test for alphabetic or lowercase characters are more portable and handle non-ascii and multibyte characters.

The package now includes a paper on SRILM, to appear in ICSLP-2002, that gives an overview of the software and its design (doc/paper.ps).

1.3.2 3 September 2002

New functionality:

Added ngram-count and ngram-count -nonevents option to specify a

subset of words that are to be non-events, i.e., tokens that can only occur in contexts (such as <s>). * Extended ngram-count discounting options for up to 9-grams. * Added support in Vocab and Ngram classes for processing meta-counts (counts-of-counts). * Added ngram-count -meta-tag and -kn-counts-modified options to support make-big-lm. * Added ngram-count -read-with-mincounts flag to suppress counts below cuttoff thresholds at reading time. This dramatically lowers memory consumption, and speeds up make-big-lm operation (which used to use a gawk script for the same purpose). * Added option to specify vocabulary to add-pauses-to-pfsg for cases where heuristics fail. * lattice-tool can now handle arbitrary order LMs for expanding lattices. The old trigram expansion algorithm is still available with -old-expansion; the compact trigram algorithm is unchanged with -compact-expansion. * To better support lattice expansion, two new functions have been added to the LM interface: contextID() takes an optional word argument, to compute the context needed to predict a specific word, and contextBOW() is a new interface to compute the backoff weight associated with truncating a history. * Added makefile support to generate executable versions that use "compact" data structures. See item 9 in INSTALL for details, and doc/time-space-tradeoff for a simple benchmark result.

Bug fixes:

Convert pseudo-log(0) value (-99) in DARPA backoff models back to

true log(0) on reading. This ensures that non-event words in the input are treated as zeroprobs (by the perplexity computation and otherwise). * Avoid NaN floating point results in N-best rescoring and nbest-optimize, by handling 0 * log(0) more carefully. * Handle -Inf AM and LM scores in SRILM N-best format. * make-big-lm was reworked to support KN in addition to GT discounting. Warning: the modified lower-order counts for KN are created using merge-batch-counts and can get almost as big as the original counts. Beware of the additional disk space and run time requirement! * Clear out old parameters before reading or estimating N-gram models. * Reading in new class definitions into ClassNgram object now deletes old definitions (unless classes file is empty). * Destructors for Ngram and ClassNgram now free N-gram and class definition memory. * nbest-pron-score: avoid core dump when pronunciation information is missing from N-best list. * make-ngram-pfsg: fixed generation of unigram PFSGs. * Avoid use of toupper() in add-pauses-to-pfsg. * Handle ngram-count -order 0 and print warning. * Avoid using zcat in scripts since it behaves differently on different systems and depending on PATH setting. * nbest-lattice and nbest-optimize no longer strip a filename part following '.' to derive utterance ids; only known file suffixes are removed. * Fixed bugs in member declarations that were preventing TaggedVocab, TaggedNgramStats, and StopNgramStats from working correctly. * compute-sclite now ignores utterances with a reference of "ignore_time_segment_in_scoring", consistent with NIST STM scoring. * Vocab.h now defines SArray_compareKey() for strings over VocabIndex, allowing use as keys in sorted arrays. * ClassNgram now uses the processed words as the context after an OOV. This works better when the input contains context cue tags. * i386-solaris platform was not being detected by machine-type script.

1.3.3 2 March 2003

New functionality:

Increased maximum number of interpolated LMs in ngram, hidden-ngram,

and lattice-tool to 10. * ngram now computes static interpolation (N-gram merging) of up to 10 input LMs (consistent with handling of dynamic interpolation). * ngram and lattice-tool -limit-vocab option limits LM reading to those parameters that pertain to words specified by -vocab. The LM:read() function got an optional second argument for this purpose. ngram -limit-vocab -renorm now effectively does the same as the change-lm-vocab script. However, the main purpose of -limit-vocab is to save memory by discarding N-grams that are not relevant to a test set. * rescore-decipher -limit-vocab precomputes the vocabulary used by N-best lists and invokes ngram -limit-vocab to allow rescoring with very large models on machines with little memory. * Ngram::mixProbs() now has version that destructively merges an Ngram into an existing model. ngram -mix-lm now uses this version, instead of the old, non-destructive one, thereby achieving considerable time and space savings (only two models, rather than 3, have to be kept in memory at a time). * ngram-count and ngram -map-unk option, to change the "unknown" word token string. * compute-sclite, compare-sclite now understand multiple -S options to specify intersections of several utterance subsets for scoring. * make-batch-counts now ignores lines in input file list that start with # (allowing comments in the file list). * Added replace-words-with-classes partial=1 option to prevent multi-word replacements that include multiple whitespace characters (i.e., "a b" is only replaced with a single space between the words). * New LM script: sort-lm, reorders N-grams lexicographically, as required by some other software (e.g., Sphinx3, pointed out by Mikko Kurimo <mikkok@james.hut.fi>). * New training script: reverse-text, reverses word order in text file. * New pfsg script: pfsg-vocab, extracts vocabulary used in PFSGs.

Bug fixes:

disambig and hidden-ngram -keep-unk now also causes LM to be

treated as open-vocabulary. * HiddenNgram class (debug level 2) was omitting the event after the last word from the Viterbi backtrace. * ngram -expand-classes was including -pau- word in expanded LM. * Made backoff computation in Ngram:wordProbBO() more efficient, avoiding multiple lookups in the context trie. Gives about a 30% speedup in ngram -debug 3 -ppl. * ngram -lm reading is faster by about 8% due to a code optimization. * ngram-count -order 2 -kndiscount3 no longer aborts with an error. The -order option effectively limits the discounting parameters computed, so that the model order can be changed without having to adjust the smoothing options. * make-big-lm -trust-totals option is ignored with KN discounting, they don't work well together. * make-big-lm now checks that input counts files are not stdin. * Reading N-best lists in Decipher format now sets the number-of-words score, so that weight rescoring, optimization etc. can use them. * ngram-count normalizes the N-gram probabilities for a context to 1 if the backoff distribution for that context has probability mass 0. The latter can happen e.g. if all N-grams for a context have been observed and received discounted probabilities. The fix ensures that the overall distribution is normalized in this case. * rescore-reweight now accepts Decipher N-best lists. * nbest-posteriors and nbest-rover now handle Decipher version 2 N-best lists better (allowing LM and WT weights to be applied). * Initialize locale in all top-level programs. disambig, hidden-ngram, segment, and segment-nbest were missing it, causing potential problems with non-ASCII characters. * nbest-lattice -write-vocab option to find vocabulary used in N-best list. * nbest-pron-score now uses idFromFilename() function to avoid over-truncating filenames when inferring sentence ids. * Added more strippable filename suffixes in idFromFilename() function. * NBest: correctly read in phone backtraces that are time-reversed. * compute-oov-rate ignores -pau- tokens. * Various N-best scripts now process input directories containing links (rather than plain files) correctly. * Lattice class takes care to limit range of intlog transition probabilities in PFSG output, so as to avoid overflow when converting to bytelog scale. * make-ngram-pfsg removes temporary file (now placed in /tmp) even when killed by signal. * Hidden-event and DF N-gram models are documented in detail in ngram man page. * Test suite result comparisons against reference output now use a script that ignores small numerical discrepancies, so as to produce fewer false alarms.

Portability:

Compiles under MacOS X (MACHINE_TYPE=macosx), thanks to help from

wooters@icsi.berkeley.edu and jean-philippe.demoulin@enst.fr.

1.4 14 February 2004

New functionality:

Added support for factored language models, developed by Katrin

Kirchhoff and Jeff Bilmes, and implemented by Jeff Bilmes. A new library, libflm.a, and two new tools, fngram-count and fngram are built in the flm/ directory. A conference paper and a technical report are included as documentation in flm/doc/. Questions and bug reports should be directed to bilmes@ee.washington.edu. FLM support has also been integrated into some of the standard tools (ngram and hidden-ngram) and is enabled by the -factored option.

Added support in lattice-tool to read/write and rescore HTK lattices.

See lattice-tool man page for details. * The lattice expansion algorithm for general LMs now preserves pause and null nodes. Consequently, lattice-tool no longer eliminates pause and null nodes prior to applying this algorithm, unless -no-pause or -compact-pause was specified. * Implemented a new algorithm to build word meshes (confusion networks, sausages) from lattices, that is faster than the original Mangu et al. method. lattice-tool -posterior-decode uses this to extract 1-best word hypotheses, and lattice-tool -write-mesh allows writing of sausages to file. * The "compact" lattice expansion algorithm that uses backoff nodes (described in Weng et al. 1998) has been generalized to handle LMs of arbitrary order. As before, this algorithm is triggered by lattice-tool -compact-expansion. (To get the old version, which handles only trigrams and produces non-identical results, use lattice-tool -compact-expansion -old-expansion.) * lattice-tool -density allows pruning of lattices to a specified density (in addition to the posterior threshold). * lattice-tool -multi-char option allows designating characters other than underscore as multiword delimiters. * Added a "LatticeLM" class that emulates a language model using the transition probabilities in a lattice. This is useful for debugging and comparing the probabilities assigned by lattices to corresponding LM probabiltiies. A new option lattice-tool -ppl makes use of this class (analogous to ngram -ppl). * lattice-tool lattice algebra operations (or, concatenate) can now be applied to multiple input lattices, always using the same lattice as second operand.

ngram has enhanced N-best rescoring functionality, allowing

multiple input lists to be rescored (-nbest-files, -write-nbest-dir, -decipher-nbest, -no-reorder, -split-multiwords). * rescore-decipher -fast enables a faster rescoring mode that uses only the built-in functions of ngram, thus running much faster. * New option ngram -rescore-ngram to recompute the probabilities in an N-gram model using an arbitrary other LM.

Added original (unmodified) Kneser-Ney discounting (ngram-count

-ukndiscountN options). Contributed by Jeff Bilmes. * New disambig -classes option to read vocabulary maps in classes-format(5). * New disambig -write-counts option to output word/class substitution bigram counts (useful to reestimate class membership probabilities). * nbest-pron-score -pause-score-weight creates weighted combination of pronunciation and pause LM scores. * compute-sclite -noperiods option to delete periods from hyps for scoring purposes. * New script empty-sentence-lm to modify existing LM to allow the empty sentence with a given probability. * compute-sclite handles CTM files in RT-03 format. * ngram-class -debug 2 prints the initial word-to-class assignments, so that the entire class tree can be reconstructed from the output. * RefList class has option to read and look up reference words without associated ID strings (indexed by integers). * Enhanced WordMesh and WordLattice classes to have an optional "name" field, used to record utterance ids. * New select-vocab command to implement likelihood-optimizing vocabulary selection from multiple corpura. Contributed by Anand Venkataraman and Wen Wang. See man page for details.

Bug fixes:

ngram avoids reading classes file multiple times if -limit-vocab

is not being used (otherwise it is unavoidable, and will lead to errors if the reading is from stdin). * Fixed some bugs in compare-sclite and compute-sclite. * Modified ngram and compute-best-mix so that the latter works with ngram -counts output. ngram -counts now outputs the count values != 1 for each N-gram so that compute-best-mix can take them into account in the optimization. * rescore-reweight and nbest-rover were not handling Decipher N-best lists correctly when additional score directories are given. * nbest-rover -wer disables use of nbest-lattice -use-mesh option, so nbest-rover can be used for old-style word error minimization (or even 1-best rescoring, by also specifying -max-rescore 1). * lattice-tool -ref-file and -ref-list were being ignored when processing only a single input lattice. Fixed so that lattice error can now be computed with either -input-lattice or -input-lattice-list. * Enhanced MultiwordLM class with new contextID() and contextBOW() versions that better reflect the backoff behavior of the wrapped LM class. Makes it much more efficient to use the lattice-tool -multiword option, i.e., expand a multiword lattice with a non-multiword LM. * rescore-decipher -pretty had a bug that caused mapping to be applied to the score fields as well, potentially corrupting the format. * Fixed bugs in mixture lambda computation (ngram, hidden-ngram, lattice-tool), triggered by more than one lambda being zero, or using more than 5 mixtures. * lattice-tool algebra operations used to crash if operand lattices contained NULL nodes. * Non-compressed files ending in .gz can now be read successfully. * Catch a possible 0/0 problem in the Good-Turing discount estimator. * Fixed memory management for strings returned by TaggedVocab::getWord() thereby avoiding garbled results. * lattice-tool -pre-reduce-iterate and post-reduce-iterate arguments where not being used to control number of lattice reduction iterations. * Fixed an unitialized memory bug that could produce random results in posterior probability computation (and hence in lattice pruning). * Fixed a bug in lattice pruning triggered by unnormalized posteriors greater than 1.

Portability:

Fixed some problems compiling with gcc-3.2.2; eliminated compile-

time warnings about division by zero in constant definitions. * Rewrote some code to work around limitations and warnings in the Intel C++ compiler. (In return, got compiled code that runs 10-20% faster!) For processor-specific optimizations, use

make MACHINE_TYPE=i686-p4 .

Fixed some script problems that surfaced in latest gawk version.

Fixed some problems compiling with Tcl/Tk-8.4.1.

FreeBSD support (contributed by Zhang Le <ejoy@peoplemail.com.cn>).

Updated Nuance-related features in PFSG scripts and man page.

Note: Integration of FLM support required some changes to the

Vocab and Ngram class interface. In particular, several member variables (e.g., Boolean Vocab::unkIndex) have been replaced by virtual member functions that return references to the variables (e.g., Boolean &Vocab::unkIndex()). This requires, albeit trivial, changes to any client code that accesses these variables.

1.4.1 9 May 2004

Functionality:

New option lattice-tool -htk-quotes to enable the HTK quoting

mechanism that allows whitespace and non-printable characters to be used in word labels. (This is disabled by default since other SRILM tools don't allow such word strings.) * New option lattice-tool -add-refs to add a path corresponding to the reference word string to each lattice. * New option ngram -counts-entropy to compute entropy (log probabilties weighted by joint N-gram probability) from counts.

Bugs fixed:

nbest-lattice could core dump if references where not supplied.

FLM/ProductVocab: fixed problems with mapping of <s> and </s> to

factored form. * Lattice algebra operations (or, concatenate) now preserve HTK link information and lattice names. * Fixed LM::contextProb() handling of <s> and other non-event tokens. This also allowed Ngram:computeContextProb() to be eliminated. * LatticeFollowIter iterator no longer takes lookahead parameter -- lookahead is unlimited and cycles are avoided by keeping a table of visited nodes. This also greatly speeds up lattice expansion in some cases. * Detect negative discounts in modified Kneser-Ney method, arising from non-monotonic counts-of-counts. * Fixed various debugging output messages in the Lattice class.

Portability:

Matthias Thomae <thomae@ei.tum.de> found that make-ngram-pfsg

(and probably other gawk scripts) may not work correctly with recent versions of gawk unless the environment is set to LC_NUMERIC=C.

1.4.2 19 October 2004

Functionality:

lattice-tool -factored option to handle factored LMs (analogous

to ngram and hidden-ngram). * lattice-tool -nbest-decode generates N-best lists from lattices (contributed by Dustin Hillard, University of Washington). * lattice-tool -output-ctm option to generate CTM-formatted 1-best output, either with -viterbi-decode or with -posterior-decode. Of course this requires HTK input lattices containing timemarks. * Added version of WordMesh::minimizeWordError() that returns acoustic information in a NBestWordInfo array, to support the above. * lattice-tool -insert-pause option to insert optional pause nodes in lattices. * lattice-tool -unk will map unknown words to <unk> instead of automatically augmenting the vocabulary (the -map-unk option allows the mapping of unknown words to be customized). * lattice-tool -acoustic-mesh records word times, scores, and phone alignments when confusion networks are built. * lattice-tool -ignore-vocab option to define the set of words that are ignored in LM processing (like pause nodes). * lattice-tool -write-ngrams option to compute expected N-gram counts from lattices. * HTK lattices now supports up to three "extra" score fields (x1..x3), which can be used to rescore hypotheses with arbitrary non-standard knowledge sources. * Added support for the "s" key in HTK lattices (used to encode state alignment info). * anti-ngram -min-count option to prune N-grams with expected frequency below specified threshold. * ngram -adapt-marginals and related options to trigger use of unigram marginals adaptation, following Kneser et al. (Eurospeech 97). * New LM class AdaptMarginals to support the above. * nbest-lattice and lattice-tool -hidden-vocab option allows specifying a subvocabulary that should not be aligned with regular words when building confusion networks. * New VocabDistance subclass SubvocabDistance, to support the above. * nbest-optimize -combine-linear and -non-negative options, useful to optimize linear combinations of posterior probability scores.

Bugs fixed:

lattice-tool: Avoid disconnecting lattice in density pruning.

Utility script installation was not working for Cygwin hosts.

ProductNgram::contextID() now returns hash code of context used,

instead of zero, and limits context-used length to order-1. * HTK lattice output was omitting wdpenalty value. * Improved collision-prone hash function for VocabIndex arrays. * Documented order of operations in lattice-tool(1). * Fixed excessive /tmp space usage in nbest-rover script, so as to avoid frequent incomplete output with large N-best data as a result of running out of disk space. * Fixed bug in compute-sclite that would garble STM references without the optional 6th field. * Fixed bug in Trie::insert(), which would always set foundP = true, even if a new entry was created. * Preserve Lattice:limitIntlogs flags in lattice algebra operations. * Use sorted node map iteration in lattice-tool expansion algorithms, so that results are not subject to pseudo-random hash table ordering. * HTK lattice output no longer has more nodes/links than input (provided -no-htk-nulls, -htk-scores-on-nodes, or -htk-words-on-nodes are NOT used). * Take default lattice name from input filename, rather than output filename (which may not be defined), however: * The embedded names of output lattices from binary lattice operations are derived from the output file name. * Fixed bug in reading of word meshes (confusion networks) introduced in release 1.4. * Fixed a bug in alignments of multiple confusion networks, affecting cases where the inputs have posterior masses != 1.

1.4.3 3 December 2004

Functionality:

Increased the number of extra scores supported in HTK lattices

(x1, x2, ... x9). * lattice-tool -nbest-viterbi option to use Viterbi N-best algorithm, which uses less memory (contributed by Jing Zheng). * Added nbest-lattice -output-ctm analoguous to lattice-tool. * Make -output-ctm output word posteriors in the confidence field. * Extend the meaning of the nbest-lattice -max-rescore option so that, in lattice mode, it limits the number of hypotheses that are aligned. (The meaning of -max-rescore was previously only defined in N-best rescoring mode). * Added -version option to all top-level programs.

Bug fixes:

Improved efficiency and duplicate elimination in A-star N-best

generation (contributed by Jing Zheng). * Worked around a problem with gawk scripts in Linux handling of /dev/stderr device which can cause a file to be truncated if stderr is redirected to it. * MultiAlign::addWords() was not preserving NBestWordInfo.

Other:

Various small code changes for compilation with gcc 3.4.3.

Maintenance scripts moved to $SRILM/sbin/.

Support for commercial releases excluding third-party code

contributions.

1.4.4 6 May 2005

Functionality:

ngram-count now allows use of -wbdiscount, -kndiscount, etc.,

without a specified N-gram order, to set the default discounting method for all N-gram orders. As before, this can be overridden by -wbdiscount[1-9], -kndiscount[1-9], etc., for specific N-gram lengths (suggested by Anand). * lattice-tool -keep-pause has additional side-effects if used with -nonevents and -ignore-vocab (making pauses behave like regular words). * lattice-tool -dictionary-align option triggers use of dictionary pronunciations for word mesh alignment (contributed by Dustin Hillard). * New option lattice-tool -nbest-duplicates allows control over the number of duplicate word hypotheses to output (from Dustin Hillard). * Update to the FLM tools from Kevin Duh, to make fngram-count use the -vocab option to limit the vocabulary of the estimated model. * Added nbest-optimize -hidden-vocab option to constrain the alignment of a subvocabulary (analogous to nbest-lattice -hidden-vocab). * wlat-stats computes the posterior expected number of words in the input lattice.

Bug fixes:

ngram -unk maps unknown words in N-best hyps to <unk> instead of

adding them to the vocabulary. * lattice-tool: Don't punt when encountering a NULL word node with pronunciation, output a warning instead. * lattice-tool -nbest-decode now uses a double-ended heap data structure, and -nbest-max-stack drops hypotheses from the bottom of the heap instead of the top (contributed by Dustin Hillard). * lattice-tool -nbest-decode now does more thorough duplicate removal (not just adjacent duplicates are removed). * lattice-tool no longer gives an error if input lattice has posteriors specified on nodes (even though they are effectively ignored). * select-vocab: miscellaneous bug fixes from Anand. * nbest-lattice: fixed various bugs with -nbest-backtrace option. * compute-sclite: work around bug in csrfilt.sh -dh affecting waveform names containing hyphens. * Minor tweaks for MacOSX build.

1.4.5 28 August 2005

Functionality:

ngram -debug 0 -ppl now outputs statistics for each input section

delimited by escape lines, in addition to overall results (based on a modification by Dustin Hillard). ngram -debug 1 and higher behave as before. * ngram -loglinear-mix implements log-linear mixture LMs. * LoglinearMix: new class to support the above. * VocabMap: added remove(.) method to remove all entries for given source word. * WordMesh: added wordColumn() function to return confusion set at given position (contributed by Dustin). * Lattice: added readMesh() function to read in confusion networks (from Dustin). * lattice-tool -read-mesh allows handling in confusion network format (from Dustin). * nbest-optimize -1best-first implements a heuristic strategy whereby the relative score weights are first optimized in -1best mode, followed by full optimization together with posterior scale. * nbest-optimize -max-time forces search to time out if new best weights aren't found within a certain number of seconds. * New script combine-rover-controls to merge multiple nbest-rover control files for system combination.

Bug fixes:

disambig clears old map entries when encountering a duplicate

definition for a source word. * nbest-optimize: posterior scaling of fixed weights was broken. * WordMesh, nbest-lattice: do better error checking on reading confusion network files, handle numalign and posterior specs out of order. * lattice-tool had a bug in the handling of HTK format lattices that do not contain an explicit specification of initial/final nodes. * Added proper copy constructors and assignment operators for Array, SArray, and LHash classes. This in turn makes the copy constructor for NgramLM and other classes work properly. (Assignment still doesn't work for some higher-level classes because of reference (&) variable members.) * Fixed minor bug in the ngram -skipoovs implementation, found by Alexandre Patry.

Portability:

Port to win32-mingw platform (by Jing Zheng). Doesn't support

compressed file i/o, or the -max-time options in nbest-optimize and lattice-tool. * Minor tweaks for compilation with gcc-4.0.1. * Renamed HTKLink class to HTKWordInfo, which is more appropriate and avoids a naming conflict with SRI's Decipher software.

1.4.6 20 January 2006

Functionality:

Added support for reading/writing files compressed with bzip2

(file suffix .bz2). Requires that the bzip2/bunzip2 binaries be installed.

Bug fixes:

Lattice class now creates completely empty lattices (no nodes).

This avoids having to first remove a node when reading an actual lattice. Empty lattices can be output, but not read (because at least an initial/final node has to be defined). * lattice-tool -ignore-vocab was not being used in conjunction with -viterbi-decode, -posterior-decode, -collapse-same-words, and lattice error computation. Words to be ignored are now treated same as -noice-vocab in those operations. * Fixed a bug in lattice expansion whereby backoff weights were dropped at NULL nodes (problem noticed by Teemu Hirsimaki). * Fixed bug in reading of node-specific posterior probabilities in word meshes. * Fixed a bug in lattice-tool -read-mesh, which was not creating sentence initial/final tags on initial/final lattice nodes. * Fixed a bug in the LatticeFollowIter class that could cause incorrect results in LatticeLM (lattice-tool -ppl). * When outputting PFSG lattices in HTK format, map PFSG weights to HTK acoustic scores. (But, as before, LM rescoring discards input PFSG weights and causes the probabilities to be output as LM scores.) * Scale wdpenalty values specified in lattice according to log-base. Also, scale -htk-wdpenalty specified on command line according to -htk-logbase (or default 10). * Correctly handle HTK score output with -htk-logbase 0.

Portability:

Added workaround for compilers that don't support arrays of

non-constant size (such as SunStudio and Visual C++). On these systems, Array will be used instead.

Added a new compilation option "_s" that triggers use of 2-byte

integers for vocabulary indices and counts. With compilers that implement __attribute__((packed)) correctly, this causes N-gram counts to use 1/3 less memory than in the default option, at some limitations in functionality. First, only vocabularies of up to 64k words may be used. Second, only up to 32k counts exceeding 32k may be stored. The latter is typically not a problem because in most natural data the number of very frequent words is small. Unfortunately, gcc does not currently handle __attribute__((packed)) correctly, but Intel's icc does.

Tested on Linux for PowerPC-64bit.

Tested on Linux for x86_64, using gcc.

Minor tweaks for Intel icc 8.0.

Tested on Solaris-x86 using Sun Studio 11 compiler.

Compilation still generates lots of warnings, but the resulting binaries work correctly.

Ported to Microsoft Visual C 7.0 (by Jing Zheng);

See doc/README.windows-mscv.

gcc versions older than 3.4.3 are no longer supported, though

they might still work.

1.5.0 31 July 2006

Functionality:

Added support for a binary data format for N-gram backoff models

which speeds up the reading of model files by a factor of 2 for full models, and by an order of magnitude if -limit-vocab is used. Note that the binary format is machine architecture dependent. See the ngram -write-bin-lm option (contributed by Jing Zheng).

disambig now support Bayesian or standard interpolation of up to

10 LMs, just like ngram and hidden-ngram.

Added disambig -factored option to support factored hidden tag LMs.

Added disambig -escape option to pass information unprocessed to

the output, similar to hidden-ngram.

New utility script: split-tagged-ngrams, see training-scripts(1)

man page.

New function Vocab::checkWords() for more efficient implementation

of the ngram -limit-vocab functionality.

Modified compute-sclite to support scoring of overlapped speech

with asclite program.

New NgramCountLM class implementing a mixture of count-based

maximum-likelihood estimators (aka deleted interpolation aka Jelinek-Mercer smoothing).

ngram-count and ngram -count-lm options to implement deleted

estimation and evaluation of NgramCountLM models. This option is also supported by hidden-ngram, disambig, and lattice-tool.

Added support for ngram counts stored in an indexed directory

structure, based on a format developed by Thorsten Brants for data delivered to LDC by Google. This data format can be used in conjunction with the NgramCountLM class, and may be generated from standard ngram count files using the make-google-ngrams script (see training-scripts(1)).

Added NgramStats::clear() function.

Added the limitVocab option to the NgramStats::read() function.

In conjunction with NgramCountLM, this allows use of arbitrarily large N-gram statistic on limited test sets.

Added ngram-count -limit-vocab option.

Added hidden-ngram -vocab and limit-vocab options.

Possible incompatibility: the -hidden-vocab wordlist must not contain the noevent word; it is added implicitly.

Added lattice-tool -write-vocab option to extract vocabulary from

lattice files.

Added lattice-tool -init-mesh option to align lattice to preexisting

confusion network.

Added an interface for vocabulary aliasing (name mapping) to

the Vocab class, and the option -vocab-aliases to the programs disambig, hidden-ngram, lattice-tool, nbest-lattice, ngram-count, and ngram. This allows direct use of LMs with slightly mismatched vocabularies relative to some test data. Also, added handling of the -vocab-aliases option to the rescore-decipher script, so that large name mapping files can be subsetted when -limit-vocab is in effect (so that only the relevant portions of an LM are loaded).

disambig now automatically limits LM reading to the words found in

the map file (suggested by Jing Zheng).

hidden-ngram -bayes and -bayes-length options added to give more

control over interpolation.

The default count type is now "unsigned long" intead of

"unsigned int". This makes no difference on 32-bit platforms, but on 64-bit platforms it allows the handling of data upwards of 4.3 billion tokens (which would causes integer overflow on 32bit machines).

For 32-bit platforms, added a compile option "_l", which triggers

use of 64-bit "long long" integers for count storage. This uses the XCount class to avoid needing extra memory for count storage, assuming that large count values will be sparse.

Bug fixes:

Fixed a bug in the handling of -mix-lm[789] options in ngram,

hidden-ngram and lattice-tool. (With the -bayes option in effect, the -mix-lm6 argument was used for -mix-lm[789].)

Fixed memory management in the XCount implementation, which was

giving incorrect results when compiling with OPTION=_s.

disambig no longer adds <s> and </s> tokens if input already

contains them (consistent with ngram).

lattice-tool -read-mesh was broken in the previous release, now

works again.

lattice-tool -density-prune and -nodes-prune now work without

-posterior-prune being specified.

The -debug option was being ignored with ngram -null .

Fixed a bug in Vocab::remove(VocabString) that could be triggered by

interactions between ngam -vocab and -vocab-aliases .

Tweaks to MACHINE_TYPE=msvc compilation. updated documentation in

doc/README.windows-cygwin and doc/README.windows-mscv.

Tweaked compiler flags for Solaris to handle files larger than 2^31.

Prevent possible NaN probabilities in ClassNgram.

Fixed a problem in make-ngram-pfsg triggered by a word named "BO".

Support long int key values in data structures.

rescore-decipher -filter option now works correctly in conjunction

with -limit-vocab.

1.5.1 20 November 2006

Functionality:

ngram-count -write-binary is a new option to create binary count

files, which load much faster. They are recognized automatically by ngram-count -read, and can be used in count-based LMs.

Revised binary backoff LM format (ngram -write-bin-lm) to use only

a single data file and be machine-independent and somewhat more compact. Reading the 1.5.0 binary format is still supported, but not writing it.

Added lattice-tool -bayes and -bayes-scale options for compatibility

with ngram and other programs.

New lattice-tool -write-ngram-index option to generate an index of

N-gram occurrences in a lattice.

New lattice-tool -multiword-dictionary option enables accurate

handling of acoustic information (timestamps, pronunciations) when the -split-multiwords option is used (contributed by Dustin Hillard).

New nbest-optimize -insertion-weight and -word-weights options to

implement weighted forms of word error optimization.

New option make-ngram-pfsg no_empty_bo=1 to disallow an empty (null)

path through the PFSG via the unigram backoff.

New script get-unigram-probs to extract unigram probabilities from

an LM file.

Bug fixes:

Enabled large-file (64bit offsets) handling for Linux 32bit

compilation.

Fixed utility and test scripts to support platforms that don't

support compressed file I/O. Check test/README for instructions.

Fixed bug in compute-sclite that could lead to failure if

waveform names contain hyphens, or sort differently after mapping to lowercase.

Fixed another bug in compute-sclite that was preventing

compare-sclite from working.

Fixed a typo-bug in Ngram::estimate that could cause problems in

handling discounting errors, but in practice seems to have been harmless (from Federico Cesari).

Improved MSVC portability: - fixed header file usage - enabled binary file i/o for binary LMs - fixed miscellaneous compiler warnings - simplified build (see doc/README.windows-mscv) - workaround in WordMesh.cc to avoid a compiler bug (from

Federico Cesari).

Fixed win32 (Windows gcc, not cygwin) build.

1.5.2 6 March 2007

Functionality:

Support binary LM formats (based on Ngram binary format) for most

LM classes.

New lattice-tool -htk-logzero option to set a dummy score to

replace zero scores found in HTK lattices.

Bug fixes:

Make sure Google ngrams can be read in both compressed and

uncompressed format if platform supports both.

Make sure the file pointer is updated when reading binary Ngram LM.

This enables reading multiple LMs from one file, and avoids errors reading binary class-LMs.

Avoid NaN values when a lattice score is infinity and the

corresponding scale factor is 0 (the score is ignored in that case).

Avoid degenerate decoding results if lattice hypotheses contain

-infinity scores. (Effectively, -infinity is replaced by a large negative log score, thus allowing the decoder to rank hypotheses based on their non-infinity components.)

Updated lattice-tool man page to clarify the interaction of

LM rescoring and lattice decoding.

Portability:

Added configuration for Solaris amd64 platform with

Sun C compiler (amd64-solaris_spro).

Updated instructions for MSVC build (see doc/EADME.windows-msvc),

based on imput from Mike Frandsen. Merge MSVC .manifest files into binary before installation.

1.5.3 28 July 2007

Functionality:

New ngram-count -write-binary-lm option to output LM in binary format

(avoids the need to dump ascii format first, and then convert to binary using ngram tool).

New make-google-ngrams yahoo=1 option to read Yahoo ngram corpus

(which needs to be sorted first, however).

New make-big-lm -ngram-filter option to pipe input counts through

an arbitrary filter program (e.g., for format conversion).

The make-kn-discount utility will now try to estimate missing

counts-of-counts based on their global statistics, using an empirical law: log f(k) - log f(k+1) = C / k for some constant C. Note this functionality is not implemented in the C++ code for KN discounting. Therefore, it is only available when building LMs with make-big-lm.

New scripts tolower-ngram-counts and uniq-ngram-counts to help

manipulate counts files.

New option ngram-count -write-vocab-index (for debugging).

Vocab.h: Increased maxWordLength constant from 256 to 1024.

Trie class can now initialize root node size with optional constructor

argument (similar to other container classes).

LHash and SArray classes have a new function to preallocate space

following construction (but before any data is inserted).

The platform "i686-p4" has been renamed "i686-icc" (Linux x86 with

Intel compiler) for consistency.

Bugs:

Fixed a buffer overrun problem triggered by nbest rescoring of

empty hypotheses.

Fixed problem in compute-sclite with extraction of speaker labels

from ctm files.

NBest class (affecting nbest-pron-score): strip Decipher-specific

phone diacritic labels separated by underscores from pronunciation strings.

Fixed memory leak in Trie::removeTrie(). This was causing a leak

in NgramLM deallocation.

Fixed a performance bug which caused the building of unigram

hash tables to have quadratic time complexity (due to an unfortunate interaction between hash table iterators and hash functions).

Made make-big-lm detect missing -read option and print usage message.

Also, handles degenerate -kndiscount with -order 1 now.

Workaround for icc compiler error: optimization disabled for some

files when using MACHINE_TYPE=i686-m64-icc.

1.5.4 2 November 2007

Functionality:

New option ngram-count -addsmooth for additive smoothing.

A corresponding new discounting subclass "AddSmooth" is defined in Discount.h.

New option ngram -server-port to start a "probability server"

(based on a contribution by Elad Dinur).

WordLattice: print lattice name in warning messages.

lattice-tool -keep-unk option to preserve labels of OOV words in

LM rescoring (currently works only for HTK lattices).

New option nbest-optimize -anti-refs and -anti-ref-weight to

decorrelate errors with another set of hypotheses.

New support in nbest-optimize for BLEU optimization and Powell search

(from Jing Zheng).

New option ngram-class -save-maxclasses to start the saving of

intermediate results when a specified number classes is reached (suggested by Shlomo Wavrow and Mats Svenson).

Bugs:

Fixed incorrect reference output for test "nbest-rover-acoustic".

Fixed a possible problem with tests "ngram-class" and

"ngram-count-lm-limit-vocab" in non-C locales.

nbest-lattice: Avoid aligning reference words with -dump-errors or

-wer, which would cause crash because no lattice is being generated internally.

make-batch-counts, merge-batch-counts: be more portable by dynamically

finding the right options to use with xargs.

add-pauses-to-pfsg: Avoid using a regular expression construct that

causes a gawk error in UTF-8 locales. However, to ensure this works correctly a gawk version of 3.1.5 should be used. See note in doc/README.linux. If the test "make-ngram-pfsg" fails a workaround is to set LANG=C or LANG=en_US and avoid UTF-8.

Fixes an uninitialized member variable in the unary constructor for

class File, which was causing garbage to be return on the first getline().

common/Makefile.machine.macos: Updated Tcl linking instructions

(from Chuck Wooters).

Makefile: exit immediately if any of the subdirectories result in

build errors.

1.5.5 6 November 2007

Bug fixes:

Fixed Makefile problem in binaries depending on libraries that was

preventing executables being generated on some platforms.

Fixed a compilation problem with MSVC for nbest-optimize.

Use MSVC _getpid() in ngram -generate random seed initialization.

1.5.6 2 January 2008

Functionality:

New ngram -use-server option to run the client side of a network LM

server as implemented by ngram -server-port. Optionally, probabilities may be cached in the client (option -cache-served-ngrams). Mixtures of one or more network and file-based LMs are also possible.

Likewise, disambig, hidden-gram, and lattice-tool understand the

-use-server option.

New LMClient class to implement the above (a stub LM subclass that

queries a server for LM probabilities).

ngram -server-port now behaves like a true server daemon: it handles

multiple simultaneous or sequential clients, and never exits (unless killed). The number of simultaneous clients may be limited with the -server-maxclients option.

Support for 7-zip compressed files (suggested by Alexy Khrabrov).

lattice-tool -split-multiwords will now print a warning message

about multiwords that were not split because their LM probability was non-zero.

LoglinearMix LM class supports n-way mixtures directly, giving more

efficient implementation for n > 2 than recursive object construction in ngram (contributed by Tanel Alumae).

Bug fixes:

MultiwordLM now implicitly adds all words to the vocabulary, so that

previously unseen multiwords get split. This has the side effect that OOVs will appear as zeroprob words.

Documentation:

The doc/FAQ file has been expanded and reformated as a man page.

It can be viewed with "man srilm-faq" or online at http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.html . The major content additions are questions about the build process, how to build a "Google N-gram LM", smoothing issues, and OOV-handling (the latter by Deniz Yuret). Corrections and additions to this document are most welcome!

A new manual page ngram-discount(7) gives a detailed overview of

smoothing methods found in SRILM (contributed by Deniz Yuret).

The conversion of man pages to html has been enhanced to better

handle code samples and nested itemized lists.

1.5.7 14 October 2008

Functionality:

make-big-lm -text option allows building of LMs that only contain

N-gram contexts that are needed for a given test set, thus saving space.

ngram-count -intersect option allows reading of counts to be

restricted to an N-gram subset.

NgramStats added a Boolean switch "intersect" and a method

setCount(), used for implementing the above.

Allow changing the character used to compound multiwords, using the

new option -multi-char with ngram, anti-ngram, nbest-lattice, nbest-optimize, nbest-pron-score, and several of the nbest-scripts.

New options -no-sos and -no-eos for ngram-count and ngram tools,

to control the insertion of <s> and </s> tokens around sentences.

New lattice-tool -no-expansion option to decode a lattice with a

new LM without first expanding the lattice (contributed by Jing Zheng).

New CachedMem mix-in class to implement a caching memory allocator

(contributed by Jing Zheng).

Added lattice-tool -print-sent-tags option to preserve <s> and </s>

tags in lattice output format, instead of mapping them to null nodes.

Documentation:

Added redirecting http links to non-SRILM program documentation

in manual pages.

Portability:

Removed SRI-specific paths etc. from common/Makefile.machine.* .

Added a mechanism that allows site-specific customizations to be recorded in common/Makefile.site.$MACHINE_TYPE to override definitions in common/Makefile.machine.$MACHINE_TYPE, without a need to change the latter.

Bug fixes:

Always output the elements of binary count files and ngram LMs

in index-sorted order (same as the _c program version). This avoids poor performance when reading the data back in.

Fixed LMClient.h so it compiles on win32 and msvc platforms (even

though it still doesn't do anything, since Unix sockets are not supported).

Process ngram-count -writeN options after applying count smoothing,

so that the effect of any count modifications (e.g., by KN) is seen, and consistent with the -write option.

Fixed the timestamps on initial and final nodes of lattice-tool

-operation or (bug found by gaojie@hccl.ioa.ac.cn).

NgramLM: Handle cases where interpolated discounting leaves no

backoff probability mass.

AdaptiveMarginals: Now handles words that are added after LM was

created. This can happen in N-best rescoring and would previously cause an assertion failure.

Fixed bugs in IntervalHeap memory allocation, which could

cause problems in N-best generation from lattices (from Jing Zheng).

Set LC_NUMERIC=C in make-big-lm to avoid problems with non-C

locales for gawk scripts that compute discounting parameters.

1.5.8 10 May 2009

Functionality:

merge-batch-counts -float-counts option for merging of fractional

counts.

compare-sclite now includes statistical significance computation

based on a matched-pair Sign test.

Added a Perl tool to compute the cumulative binomial distribution,

contributed by Brett Kessler and David Gelbart.

Don't output LM server banner message for ngram -use-server -debug 0.

The LM::generateSentence() function now takes option argument to

specify sentence prefix that is to be used to condition subsequent word generation (suggested by Alexy Khrabrov). The default is to condition on <s> as before, or an empty context if no start-of-sentence tag is defined.

A new option ngram -gen-prefixes to read conditioning prefixes

from a file, and generate random sentences based on them.

New options in nbest-optimize that modify -print-hyps output so that

only unique hypotheses are included (-print-unique-hyps), and to print the original ranks of hypotheses (-print-old-ranks) (from Jing Zheng).

The -version option reports whether support for compressed files

is available.

Added merge-batch-count -l option to control how many files to merge

in each iteration.

Bug fixes:

ngram-count, NgramLM: disable the Doug Paul smoothing hack (add one

to denominator when smoothing results in 0 backoff mass) in contexts where the entire vocabulary has been observed.

nbest-optimize fixes to the -minimum-bleu-reference functionality

(from Jing Zheng).

Fixed nbest-optimize bug that was causing incorrect log output with

gcc 4.x.

Output vocabulary index map in binary ngram count and LM format

in numerical index order. This avoids a performance bug whereby reading the data structures back into _c binary version could take a long time due to inefficient insertion order.

Fix ngram -counts with -use-server (from Ergun Bicici).

Fixed memory allocation bug in FLM tag vocabulary handling that could

lead to crash when interpolating several FLMs.

Rewrote make-batch-counts scripts to - avoid problems with limits on command line length - support systems that don't have compressed file I/O.

Modified merge-batch-counts script to - ensure that unmerged files are always merged in the next iteration, to avoid file size imbalance (suggested by Alex Marin) - support systems that don't have compressed file I/O.

Fixed a portability issue with Intel icc version 7.0.

compute-sclite fixed to invoke csrfilt.sh script with -t option.

1.5.9 24 August 2009

Functionality:

Added ngram-count -text-has-weights option to scale counts on a

per-sentence basis.

LMStats::countString() and NgramStats::countSentence() methods

generalized to take optional weight string argument (to support the above change).

Added compile-time option to generate position-independent code

(make MAKE_PIC=yes, see INSTALL file).

Added support for xz-compressed files (.xz files offer better

compression than .gz at the expense of time and memory). The xz tool has to be installed separately (http://tukaani.org/xz).

Bug fixes:

wlat-to-pfsg generates NULL output labels for initial/final nodes

with sentence start/end tags (because PFSGs encode those implicitly).

TaggedVocab: check and report if number of tags/words exceeds max.

Make number of bits allocated for tags/words proportional to word size. Parse word/tag strings such that last (not the first) slash (/) character is treated as the delimiter.

Documented the lattice-tool -ngrams-time-tolerance option that had

been previously implemented but omitted from the man page.

1.5.10 7 Jan 2010

Functionality:

New option ngram -float-counts to allow the -counts option to

process fractional counts.

The LM::pplCountsFile() and LM::countsProb() have been templatized

(as a function of count type), and the TextStats class now uses double float counts, all in support of the above change.

New option lattice-tool -word-posteriors-for-sentences for computing

word posteriors based on confusion networks (contributed by Jing Zheng).

lattice-tool now performs confusion network decoding and ngram

computation AFTER rescoring or expansion with LMs. Therefore the two operations can be combined in a single run where previously two invocations were necessary.

Added fsm-to-pfsg map_epsilon= option, to translate FSM <eps> symbols

to another label.

New script filter-event-counts to preprocess a count file for use

with ngram -counts .

lattice-tool continues processing when one of the lattices specified

with -in-lattice-list cannot be opened.

Regression tests have been moved to module subdirectories

(lm/test, flm/test, lattice/test) and can now be run from the top-level with "make test". Decompression of data files for platforms that don't support compressed file I/O is now automatic.

Documentation:

Added new FAQ items covering handling of OOVs and zeroprob words,

based on input from Nitin Madnani.

Correction to the man page description of the ngram -count-order

option: It limits the maximal order of processed ngrams.

Corrected and updated ordered list of processing steps in

lattice-tool man page.

Bug fixes:

Use double precision to record log probs in TextStats object.

Workaround for a deficiency in Intel's 7.00 C++ compiler.

lattice-tool was not handling PFSG lattices in (1best or N-best)

decoding with a LM.

lattice-tool will exit with a non-zero status if any of the lattice

operations fail.

Fixed some format string/argument mismatches that could bite on

64-bit platforms.

Updated usage of sort with key specification to conform to latest

POSIX standard. The old syntax was no longer working with recent GNU sort versions.

1.5.11 16 June 2010

Functionality:

New program "maxalloc" to find the maximum amount of memory that

can be allocated by a user process in the current environment. May be useful to debug out-of-memory conditions.

Bug fixes:

Avoid deleting low-posterior null tokens when aligning lattices into

word meshes.

Map explicit start/end-of-sentence tags in HTK lattices to null,

since they are already implicitly attached to the start/end nodes of the lattice (LM scoring gives anomalous results on repeated tags).

option.[ch]: fixed declaration issues to avoid compiler warnings.

Moved man page for the option library functions to misc/doc.

Bug fixes:

Fixes to compile cleanly with gcc -Wall -Wno-unused-variable

-Wno-uninitialized. * Fixed a problem with gcc-4.4 compiles. * Fixed a problem with macro definition of fseeko() ftello(). * Fixed a problem with the lm/ngram-count-wb-subset test, which could fail after the test data is uncompressed. * Use gzip -d to read gzipped files, avoids shell wrapper overhead.

1.5.12 20 Jan 2011

Functionality:

Enable lattice-tool -old-decoding if -nbest-duplicates is specified

(and warn about it). * Support make-big-lm -wbdiscount option. * New option ngram -prune-history-lm, for specifying a separate LM that computes the history marginal probablities needed for N-gram pruning purposes. Inspired by C. Chelba et al., "Study on Interaction Between Entropy Pruning and Kneser-Ney Smoothing", Proc. Interspeech-2010. * Added optional limitVocab argument to VocabMultiMap::read() function. This is now used by lattice-tool -limit-vocab to avoid reading parts of the dictionary that are not used in the input. * Added an option -zeroprob-word to ngram and lattice-tool. It specifies a word that should be used as a replacement if the current word has probability zero. This is different from -map-unk which only applies to OOV words and actually replaces the word label in the output lattice, if any. * Added new wrapper LM class NonzeroLM, to implement the above.

Portability:

New MACHINE_TYPE values for Android-ARM platform: android-armeabi and

android-armeabi-v7a (from Mike Frandsen). * Deleted the htk directory from distribution; it was obsolete and not documented.

Bug fixes:

Prob.h: guard against under/overflow in intlog and bytelog

conversions. * Replaced gunzip with gzip -d in all scripts (for efficiency). * Better option checking in make-big-lm, disallowing mixing of discounting methods and use of discounting flags that are not supported. * Undefine max() macro in Trellis.h to avoid conflict with some system header files. * Better support for recent MSVC versions in common/Makefile.machine.msvc (from Mile Frandsen). * add-pauses-to-pfsg: prevent existing pause nodes from being processed.

1.6.0 8 December 2011

Functionality:

Added lattice-tool -loglinear-mix option.

Add platform-independent strtok_r() function, and replaced all

instances of strtok().

Eventual goal is thread safety and re-entrance.

Modified File object to allow I/O to/from strings as well as files.

Modified code for reading and writing HTK lattices and NBest lists to

enable I/O to/from strings as well as files, for in-memory processing. * Added special-purpose malloc/free implementation for SArray and LHash data structures, to reduce overhead for small allocation chunks. Also added some allocation statistics reporting (enabled by ngram -memuse -debug 1). * Added the metadb config file lookup tool. * Cumulative binomial script (cumbin) command accepts optional 3rd argument to set p parameter.

Bug fixes:

Correctly handle lattice-tool -use-server when generating nbest lists

(server- based LM was previously ignored). * lattice-tool -split-multiwords no longer splits words appearing in -ignore-vocab. * lattice-tool allowed to operate on HTK lattices containing unrecognized header fields (but warn about them). * Updated reference output for many build platforms to avoid spurious test failures. * Avoid abnormal backoff weights when lower-order probabilities sum to almost one. * Avoid test failures for merge-batch-counts and make-ngram-pfsg due to locale differences. * Fix maxalloc for 64bit systems where "long" is still 32 bits.

Building:

Added Microsoft Visual Studio 2005 projects, see

doc/README.windows-msvc-visual-studio for more information. * Added new Makefile targets superclean and pristine to return SRILM to pre-build state. * Add Makefiles for MACHINE_TYPE macosx-m32 and macosx-m64 to allow explicit 32- or 64-bit compilation on MacOS X 10.6. Updated GAWK location to allow tests to succeed. * Replaced various C-shell helper scripts in sbin/ with Bourne-shell versions, for greater portability. * New MACHINE_TYPE=msvc64 for 64bit builds with Visual Studio.

Documentation:

Added doc/asru2011-srilm.pdf, a paper describing SRILM updates since

Old ICSLP paper renamed to doc/icslp2002-srilm.pdf .

1.7.0 23 December 2012

Functionality:

ngram -codebook option for reading of Ngram LMs with quantized parameters

(contributed by Microsoft). * ngram -msweb-lm option for obtaining LM probabilities from the Microsoft Web N-gram service (web-ngram.research.microsoft.com). You need to obtain a user ID to use this service, see man ngram for details (contributed by Microsoft). * Added support for dictionary-induced word distance metrics to nbest-optimize (-dictionary option). * Added support for matrix-defined word distance metrics to nbest-optimize (-distances option). * ngram -debug 4 -ppl outputs ranking statistics (number of times correct word was in top 1, 5, 10), as well as quadratic and absolute loss averages (based on code from Omid Madani). * nbest-optimize accepts n-best list in SRInterp format and generates SRInterp format rover-control file (weights file), when -srinterp-format is specified. * nbest-optimize accepts SRInterp counts file that contains BLEU and TER counts info. * lattice-tool -read-mesh will try to preserve acoustic information (times, scores, pronunciations) if they are encoded in the input confusion network. * Support reading of text files in UTF-8 and UTF-16 encodings. All string data is internally represented, and output, as ASCII/UTF-8 (contributed by Microsoft). This feature uses the iconv library. Support for this feature can be disabled by compiling with "NO_ICONV=anything" on the make command line.

Portability:

Ported LM client/server code to Winsock API (native socket library in

Windows), enabling this functionality for mingw and MSVC platforms (contributed by Microsoft). * Let machine-type script return 64bit platform names for Linux and Solaris x86 when appropriate. This implies that 64bit binaries are built by default on machines that support them. * Array.h tweak for clang compiler (from kutlak.roman@gmail.com). * Work around a namespace problem in C++11 (from kutlak.roman@gmail.com). * Use size_t for hash codes to ensure word width matches pointer type. * Fixes for mingw32 build, using Windows APIs for sockets and UTF conversion (contributed by Microsoft). * Support for 64bit mingw build (MACHINE_TYPE=win64). * Updates for MacOSX (MACHINE_TYPE=macosx, thanks to Chuck Wooters). * Deal with nonportability of isfinite() and isnan(). * Changes for thread-safety (by Kyle McIntyre). See doc/README-THREADS for details. - Modified the remove() methods in various container classes to return Boolean instead of a pointer to the removed element. The removed element can be gotten with an optional reference argument. This eliminates the need for a global static variable. - Use STL sort() instead of qsort() in LHash and SArray sorted iterations. - Replaced all static variables with thread-local storage via the TLSWrapper class, requiring the pthread library. This is available on most platforms, but can be disabled at compile-time with -DNO_TLS.

Bug fixes:

NgramLM backoff computation fixed to avoid spurious insertion of nonzero

unigram probabilities and non-unity backoff weights (resulting from numerator/denominator values below Prob_Epsilon). * lattice-tool does a better job inferring the lattice basename from the UTTERANCE string embedded in HTK lattices. * Trellis class: use a secondary sorting criterion to make N-best output deterministic. * WordMesh class: use posterior word probability to decide which acoustic information to keep when merging hyps, instead of duration-normalized acoustic stores as before. This leads to fewer words with out-of-order timestamps when extracting one-best from confusion networks. * fix-ctm script: Check for out-of-order word timestamps and adjust them minimally as needed to produce a monotonic sequence, as required for CTM sorting. * Fixed bug in NgramCountLM estimation procedure reported by ariya@jhu.edu. * Allow ngram -hidden-vocab to read hidden event properties described in man page. * Fixed bug in ngram -hidden-vocab -write-lm output. * Avoid crash when ngram -hidden-not -ppl is used with debug level 2. * Fixed (very rare) bug by which ngram -prune might remove all ngrams sharing a common context. * Improved ngram -prune-lowprobs by also removing backoff weights that have become useless (suggested by Arlo Faria). * Check for successful search for HTK lattice start/end nodes, if not explicitly specified (reported by nshmyrev@yandex.ru). * Handle infinity scores in lattice rescoring, and catch NaN scores when reading HTK lattices. * make-kn-discounts checks for negative discount values and reports error if appropriate. * nbest-optimize accepts combined BLEU and error rate objective via switch -error-bleu-ratio R (R specifies the error rate weight). * lattice-tool -timeout option now uses sigsetjmp/siglongjmp to handle timeout alarms. This is necessary in Linux-compatible (including cygwin) systems to handle alarms repeatedly. * Fixed a bug reading NBestList2.0 format without phone information (led to malformed confusion network output). * Fixed a bug in Ngram::contextID() that was causing incorrect expansion of lattices with pruned backoff models. * Fixed a bug in the lattice-tool -keep-unk implementation that was sometimes allowing an OOV word label to be output as <unk>. * Removed some pseudo-randomness in ngram-class so that results are more invariant to OPTION setting and platform properties. * Avoid differences due to machine arithmetic in word mesh alignment, making confusion network building and posterior decoding more stable across platforms. * Exclude metatags when writing out the vocabulary of binary Ngram LMs. * Fixed some missing dependencies in Visual Studio solution file.

1.7.1 4 June 2014

Updated INSTALL, Copyright. Added ACKNOWLEDGEMENTS.

Functionality:

Integrated the maximum entropy extension by Tanel Alumae, described

at http://www.phon.ioc.ee/~tanela/srilm-me/ . Please cite Tanel's paper (copied here in doc/is2010-maxent.pdf) if you use this functionality in your research. * Enable LM server to process multiple commands in a single message (separated by newlines). This capability was never documented, but existed in the first implementation that used read/write system calls, but was lost when we switched to recv/send calls. * Generalized the BayesMix LM class to allow an arbitrary number of mixture components, similar to LoglinearMix. * Added the ngram -context-priors option to read context-dependent mixture weight priors from a file. * Added the ngram -read-mix-lms option to read the list of interpolated LMs, weights and options from a file, specified by the -lm option. * Use zlib for I/O from/to gzipped files. Benefits are: (a) works with native Windows binaries, (b) avoids subprocess, (c) allows reading (though still not writing) of gzipped binary LM and count files. * ngram-count -gtNmin options accept floating point values for more flexibility with LM estimation from fractional counts. * Added lattice-tool -set-lattice-names option to preserve input filenames inside lattices. * New script replace-unk-words, for replacing OOV words relative to a vocabulary with <unk> tag. * Added new lattice-tool options -hyp-list -hyp-file -hyp2-list -hyp2-file -add-hyps to add ASR hypotheses into word mesh (confusion network). The added options are similar to -ref-list -ref-file -add-refs, except that the added hypothesized words will not be indicated as reference words in the word mesh. * Added a function in WordMesh to compute slot-to-slot alignment between two confusion networks. * Added ngram-class option to limit number of words per class (from seppo.enarvi@aalto.fi).

Portability:

Added support for 64bit cygwin builds (MACHINE_TYPE=cygwin64).

Bug fixes:

ngram -rescore-ngram was not setting the handling of special word

tokens (<s>, </s>) if the rescored LM was being evaluated in the same run. * ngram-count -skip needs to read counts one order higher than specified by -order . * SkipNgram will now try to reestimate the discounting parameters from expected counts on each EM iteration (but fall back on initial parameters if that fails, e.g., for discounting methods that cannot handle float counts). * SubVocab instances' handling of metatags and nonevent words is now tied to the base Vocab instance. * Avoid anomalies in random word generation due to nonzero probabilities for nonwords. * Cleaned-up select-vocab script from Anand Venkataraman. Now works with perl 5.12 and gives consistent results on different platforms. Added a test case. * Fixed removeTrie() bug that was leading to memory leak in Ngram destructor. * Fixed bug in LHash iterator that lead to potential double enumeration of items after deletions, and could affect Ngram pruning results. * Allow number of ngrams in ARPA LM to exceed 2^31. (Vocabulary size is still limited to 2^32.) * Initialize key and data objects in SArray and LHash containers after allocation. * Pass Trellis state parameters by reference to avoid copying of potentially complex objects. * Fixed memory access error in Ngram::clear() for order-1 models. * Fixed a problem handling null string states in Trellis. * Fix to preserve double precision in NBest acoustic and LM scores. * Fixed an error concerning the use of -gtNmin options in the srilm-faq(7) man page pointed out by dugast@systran.fr. * If a lattice-tool input lattice is a word mesh, avoid calling alignLattice() since the input is already a word mesh. * Fixes to reading/writing of quantization codebook files. * Fixed header comment and test program for Map2::remove().

1.7.2 9 November 2016

Functionality:

Added interfaces to Lattice and WordMesh that allows external programs

to map sausage nodes to their original lattice nodes. * New VocabDistance subclass StemDistance, comparing words only based on their stems. * New lattice-tool option -stem-dist triggers StemDistance use in confusion network alignments, including -add-hyps and -add-refs processing. * Add optional support for keyword spotting (in Lattice.h and

LatticeIndex.cc) when writing a 1-gram index.

Added new File field NBestOptions::nbestRttm2, if it exists then write (an approximation to) the NBestList2.0 format output.

Added simple Trellis pruning based on relative thresholding of forward

probabilities (Trellis::prune()). * make-big-lm now understands the -ukndiscount option. The make-kn-discounts helper script has an option to compute unmodified KN discounts. * The -version option now reports the compiler version used. * Added ngram-count -write-text option to test conversion of UTF-16 files to ASCII/UTF-8. * Added ngram -text-has-weights option to allow weighting sentences in ppl computation. * Added scripts nbest-words and compute-sclite-nbest for conveniently computing nbest-optimize -errors information using sclite. * Added the nbest-optimize -xval-files option to support cross-validation. * Added script search-rover-combo for searching for best combination among a list of systems. * Added confidence value fields to NBestWordInfo class. * Added check to compute-best-mix to warn about word label mismatches between input files.

Portability:

Honor TMPDIR environment variable in various scripts.

Miscellarous MacosX fixes.

Include BSD rand48 functions so that random sentence generation gives same

result on all platforms.

Bug fixes:

Avoid leaky backoff by mapping very small probability sums to 0 in BOW

computation. Otherwise unseen ngrams may end up with nonzero probabilties in unsmoothed LMs. * Fixed compare-ppls compute-best-mix compute-best-sentence-mix ppl-from-log to recognize the MSVC representation of -infinity. * Fixed a bug in the handling of zero prefix probabilities in ClassNgram, HiddenNgram and HMMofNgrams. * Fixed a memory allocation bug that caused the ngram-count-maxent test to crash. * Fixes to lattice-tool rttm nbest output. * Fix for possible endless loop in lattice-tool -posterior-prune due to limited float precision (from Seppo Enarvi).

Fixed a problem with declaration of Map_nokeyP() that takes reference

arguments and were missing "const"; was causing crash in segment tool. * Workaround for what looks like an optimizer bug in gcc >= 4.9 that can cause ngram -prune to core dump. * Output TextStats quantities (sentence/word counts, log probs, perplexities), model parameters, nbest and lattices scores, and other quantities with full precision so as to avoid loss of information. * nbest-optimize -1best now outputs a rover-control file that simulates Viterbi decoding (by using a small posterior scale). * nbest-optimize -errrors now tolerates varying number of reference words for the same sentence. This can arise from sclite references with alternate words strings. * Fixed a stupid bug in uniform-classes.gawk script. * Allow combine-rover-controls to merge control files with the same systems in them, adding their weights. * Updated zlib to version 1.2.8. This fixes a bug whereby gzipped output files could end up with zero size (instead of a legal gzipped file that results in a zero-length file when decompressed).

1.7.3. 9 September 2019

Functionality:

Added nbest-oov-counts script to generate OOV counts for nbest hypotheses.

Added a simple mechanism for weight tying in nbest-rover control files. A

system weight of = indicates that it should be tied to the previously listed system. This is useful for reducing the number of free parameters when searching for good system combinations (search-rover-combo). * Add Map_noKey() and Map_noKeyP() for unsigned long long type, to enable use with size_t on Windows MSVC. * Output from -version now includes compile-time options. * Added option ngram -minbackoff to fix up models that have unnormalized probabilities or that are not smoothed. * Added option ngram -unk-probs to override unknown word probabilities. * Added nbest-optimize-args-from-rover-control script, convenient for extracting initialization parameters for nbest-optimize from existing nbest-rover control file. * Added ngram-count -text-has-weights-last option to allow text input with count values at ends of lines.

Added nbest-rover -missing-nbest option to treat missing nbest lists as if

an empty hypothesis (no words) had been output, rather than simply skipping that nbest list. * Added nbest-lattice -time-penalty option, implementing a soft constraint on time stamps (when present) during confusion network building and alignment. * Added nbest-lattice -average-times option, to average word times instead of picking the timing of the highest posterior hypothesis. * Added nbest-lattice -suppress-vocab option to disallow certain words in posterior decoding. * New scripts concat-sausages for chaining word confusion networks together. * Added nbest-lattice -dump-lattice-alignments option to output mappings between sausage positions and alignment costs. * Updated Android build for 64-bit development for armv8 using NDK r20 and clang. This almost certainly breaks the 32-bit build for armv7. The last known good 32-bit build is in common/Makefile.core.android.r11c, last built using NDK r11c. To use this, copy Makefile.core.android.r11c to Makefile.core.android. See doc/README.android.

Bug fixes:

Added a new tool nbest-rover-helper that combines the functions of the

combine-acoustic-scores and nbest-posteriors scripts, doing these computations in double precision and faster. nbest-rover now uses this tool (except when certain options like -nbest-backtrace are used). * nbest-rover strips DOS end-of-line CR characters from the control file, so they no longer mess up the parsing of the file. * Rationalize the way ties are broken when decoding word confusion networks. The word with the lowest internal index is now preferred (and the DELETE token always comes before all other words), unless the new nbest-lattice option -random-tie-break is given. The output order of alternative word hypotheses to sausage files is always by probability rank first, then by internal index. * The reverse-ngram-counts script now replaces <s> with </s> and vice-versa, as required for training reverse-direction LMs, and consistent with reverse-text. * Handle comment lines starting with '##' and empty lines in nbest-rover control files the same way as in File::getline(), i.e., ignore them. * Fixed the syntax for the nbest-optimize -dynamic-random-series options (now starts with single dash, as described in man page). * Don't let compute-best-mix complain about word mismatches if <unk> is involved. * Cast input to isspace() to (unsigned char) to guarantee input is non-negative. * Fixed memory management problems in MEModel. * Work around a bug in zlib's gzprintf() printing of very long %s arguments; was causing long word strings not to be output into .gz files. * Removed word string length limit. * Removed limit on total line length in outputting ngram count files. * Zlib updated to version 1.2.11. * nbest-posteriors ensures that bytelog scores are output in fixed-point format. * Allow floating point values when parsing bytelog scores in nbest lists. * Most robustness to word sausages input files that have missing data for some position. * Fixed a performance bug when nbest-rover is invoked with -output-ctm option.

$Date: 2019/09/09 23:09:32 $

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGES

Clone this wiki locally