Raymond Lau

Excerpt from January '95 SLS Annual Research Summary

Adaptive Statistical Language Modelling

Raymond Lau

Language models are useful in speech recognition systems and machine translation systems by providing significant constraints on the search space over possible word sequences. A very successful and widely used language model is the trigram statistical language model [1], which basically provides a probability for the next word given the previous two words. This probability distribution is derived from the distribution observed over some training text. A significant shortcoming of the trigram statistical language model is that it only considers the previous two words in determining the probability of the next word. As such, the model can be considered to be a static model. In this work [2], we consider extending the trigram model so that it adapts to longer distance information such as the current topic of discourse. For our work, we will be using the Wall Street Journal (WSJ) corpus, which is a 39 million word collection of Wall Street Journal articles from 1987 though 1989. An important advantage of the WSJ corpus is that the data is divided into individual articles. We can consider each article to be an independent document and we can take documents to be the unit of adaptation for our language modelling efforts.

The primary information element we will add to the trigram model is the trigger pair. A trigger pair A->B is a pair of words, A and B, such that having the trigger, A, in the document history affects the subsequent probability of the word B. By the document history, we are referring to the sequence of words from the beginning of the document up to the word immediately preceding the word whose probability our language model is asked to score. For example, in the Wall Street Journal corpus, the presence of the word "brokers" in the document history increases the probability of subsequently seeing the word "stock" by approximately a factor of five.

Combining trigger pairs with the information provided by the trigram model into one model poses two main problems. First, we have to decide which trigger pairs to include. We cannot hope to include every possible word pair as a trigger pair because such a model will be virtually impossible to train given any reasonably sized training corpus. Furthermore, such a model will be computationally intractable to compute and employ. The second major problem is that of how we can combine the information from the trigger pairs with the information from the trigram model.

To select our trigger pairs, we rely on the average mutual information between the words in the pair. We select the top seven triggers for every given word as ranked by this criterion. We obtain what we feel is a reasonable set of triggers using this method. For example, for the word "political," we arrive at the following triggers: "political," "party," "presidential," "politics," "election," "president," "campaign."

To combine information from the trigger pairs and from the trigram model, we employ a maximum entropy framework. In this framework, the information elements are modeled as constraints. Constraints are defined via constraint functions on the document history and word being predicted. We require that the average weight or vote given to each constraint function matches the average number of times the constraint makes a correct prediction in the training data. In general, there are many sets of weights which satisfy this requirement. Of those, we pick the set which provides a distribution of maximum entropy.

In our case, we will have trigger constraints of the form c(t,w) = 1 if trigger t is in the document history and word w is what is being predicted and 0 otherwise. To incorporate the trigram model, we will have trigram constraints of the form c(w_2, w_1, w) = 1 if the last two words in the document history are w_2, w_1 and the word being predicted is w and 0 otherwise. To train our model, we can use the method of generalized iterative scaling which is guaranteed to converge upon the unique solution.

We trained our model on five million words of Wall Street Journal text and tested it on a test set of 870 thousand words. With a vocabulary of 20 thousand words, we discovered 98 thousand trigger pairs which occurred frequently enough in training data to be included in our model. We achieved an overall improvement of 22% on the perplexity of the test data as compared to a deleted interpolation trigram model. A more extensive discussion of this work can be found in [3] and [4].


[1] Bahl, L., Jelinek, F., and Mercer, R., "A Maximum Likelihood Approach to Continuous Speech Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2): 179-190, 1983.

[2] Lau, R., Adaptive Statistical Language Modelling, Master's thesis, Massachusetts Institute of Technology, May 1994.

[3] Lau, R., Rosenfeld, R. and Roukos, S., "Trigger-based Language Models: A Maximum Entropy Approach," in Proceedings ICASSP '93, Minneapolis, April 1993.

[4] Lau, R., Rosenfeld, R., and Roukos, S., "Adaptive Language Modelling Using the Maximum Entropy Approach," in Proc. ARPA Human Language Technologies Workshop, March 1993.