The CMU-Cambridge Statistical Language Modeling Toolkit

History of changes

V2.00 June 17 1997

Original Version

V2.01 July 1 1997

Corrected "Usage" information in idngram2lm.

Added percentage counts to n-gram hits chart in evallm.

Improved the documentation slightly.

No longer refer to back-off weights as "alphas" in idngram2lm. So command line options like "-two_byte_alphas" and "-max_alpha" become "-two_byte_bo_weights" and "-max_bo_weight". The old forms are still supported, in order to provide consistency with V2.00.

Tools now terminate in the event of unrecognised command-line arguments (previously they simply displayed a warning).

Fixed bug in endian.sh.

V2.02 July 11 1997

Fixed bug in mergeidngram, so that it now writes binary files correctly.

Corrected documentation of the -disc_ranges flag of idngram2lm.

Fixed -calc_mem option in idngram2lm.

Fixed bug in idngram2lm which sometimes cause segmentation faults when trying to read .gzipped files as if they were uncompressed.

Fixed bug in idngram2lm which caused major problems when constructing closed vocabulary models from idngram streams with OOVs in them. The correct behaviour (which occurs now) is for a warning to be displayed, and any n-gram in the input stream with an OOV in it to be igrnored.

Fixed bug in ngram2mgram which was causing it to handle the first and last lines of its input incorrectly.

Fixed bug in wngram2idngram which caused a segmentation fault for unigrams.

Fixed bug in the write_arpa function of write_lms.c so that we don't now have it trying to display a back-off weight for unigrams if we are only dealing with a unigram model.

Fixed bugs in idngram2lm which caused problems with the creation of unigram models.

Fixed bugs in evallm which caused problems with reading ARPA format unigram models.

V2.03 Nov 10 1997

Fixed bug in wngram2idgram which caused problems if first OOV buffer became full.

Changed temporary file names in text2idngram, text2wngram and wngram2idgram such that they now contain the hostname and process id, to avoid clashes.

Fixed bug in interpolate, so that -probs option can now be used with -cv

Fixed bug in evallm which caused problems reading ARPA format LMs when used with 4-grams, 5-grams, etc. and cutoffs of > 0. (NOTE: This problem did not affect the *writing* of ARPA format LMs)

Fixed bug in idngram2lm which caused problems if first n-gram fell below cutoff threshold.

Changed behaviour of evallm such that P(A | B C) = 1 doesn't generate a warning anymore.

V2.04 31 March 1998

Fixed error in "Typical Usage" section of the documentation.

Fixed bug when closing the file when reading a binary format language model in load_lm.c.

Changed #define VERSION 2.?? to #define SLM_VERSION 2.?? in order to prevent clashes when SLM2.h is included in non-toolkit programs.

Fixed bugs whereby not enough memory was allocated when reading a binary format language model in load_lm.c.

Fixed bug in text2idngram and wngram2idgram which caused core dump if a word of more than 500 charactors was present.

V2.05 7 June 1999

Fixed bugs concerned with reading ARPA format language models in load_lm.c.

Philip Clarkson - prc14@eng.cam.ac.uk