Raymond Lau

Excerpt from January '96 SLS Annual Research Summary

Sublexical Modelling for Speech Recognition with Angie

Raymond Lau

In the present research, we attempt to build a speech recognition system which explicitly models sublexical linguistic phenomena such as syllabic and morphological structure. We hope that such a system can provide a single unified probabilistic framework for modelling phonological variation and morphology. Many current systems handle phonological variations either by having a pronunciation graph (such as in MIT's SUMMIT system) or by implicitly absorbing the variations into a hidden Markov model. The former has the disadvantage of not sharing common subword structure, hence splitting training data. The latter masks the process of handling phonological variations and makes the process difficult to control and to improve upon. For example, in the ATIS domain, the words "fly," "flying," "flight," and "flights" all share the common initial phoneme sequence "f l ay" so presumably, phonological variations affecting this sequence can be better learned if examples from all four words were pooled together. Our system does just that. The sharing of subword structure will hopefully facilitate the search process and also make it easier to deal with new, out-of-vocabulary, words. By pursuing merged common subword theories during search, we can mitigate the combinatorial explosion of the search tree, making large vocabulary recognition more manageable. Because we expect new words to share much common subword structure with words in our vocabulary, we can easily add new words dynamically allowing them to adop existing subword structures. In principle, we can even detect the occurrence of out-of-vocabulary words by recognizing as much of the subword structure as we can in a bottom up manner.

To model constraints on the subword level, we are using a framework which we have named Angie. Angie is a layered system very similar to that proposed by Meng [1]. In Angie, we have five layers: a top level sentence layer, a word layer, a morphology and stress layer, a syllable layer, a phoneme layer and a phone layer. Adjacent layers are governed by context free rules and probabilities. Angie assigns probabilities to an internal node of a parse tree conditioned upon the node's child and left sibling and to a bottom leaf node conditioned upon the node's left column, its left sibling and that sibling's ancestors. An example of an Angie parse along with each column's score is shown in Figure 1. In our recognizer, we perform a bottom up A* search over possible phone strings, incorporating the acoustic score of each phone along with the score of the best Angie parse for the path up to that phone.

Figure 1: An Angie parse of "what is u" along with column scores. A phone preceded by a hyphen denotes a phone which did not really occur in the acoustic data.

We face several key research challenges relating to:

  • Linguistic modelling, e.g. what context free rules will provide a sufficiently detailed model of the subword structure to impose relevant constraints yet allow significant sharing.
  • Acoustic modelling, e.g. arriving at a workable set of phones and phone models for our framework
  • Search, e.g. controlling our bottom up search
The last of the these is probably the most difficult to resolve. A bottom up search permits the use of subword constraints but it also postpones the imposition of very powerful word based constraints (e.g. n-gram) until the end of a word is reached. Such a strategy may be computationally expensive and efforts will have to be made to control the use of resources. Because we also want a one pass search to permit, in principle, real time operation, we will probably have to develop a suitable future underestimate heuristic.

Our Angie based recognizer is still in its infancy. To date, we have built an operational forced search system to permit the training of the required acoustic and lexical models and the exploration of issues relating to the challenges outlined above. We have settled on the ATIS domain as the initial domain to study because there is much ATIS data available, along with performance benchmarks.


[1] Meng, H.M., Phonological Parsing for Bi-directional Letter-to-Sound / Sound-to-Letter Generation, Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA., June 1995.