| |||||||||||
Excerpt from January '96 SLS Annual Research SummarySublexical Modelling for Speech Recognition with AngieRaymond LauIn the present research, we attempt to build a speech recognition system which explicitly models sublexical linguistic phenomena such as syllabic and morphological structure. We hope that such a system can provide a single unified probabilistic framework for modelling phonological variation and morphology. Many current systems handle phonological variations either by having a pronunciation graph (such as in MIT's SUMMIT system) or by implicitly absorbing the variations into a hidden Markov model. The former has the disadvantage of not sharing common subword structure, hence splitting training data. The latter masks the process of handling phonological variations and makes the process difficult to control and to improve upon. For example, in the ATIS domain, the words "fly," "flying," "flight," and "flights" all share the common initial phoneme sequence "f l ay" so presumably, phonological variations affecting this sequence can be better learned if examples from all four words were pooled together. Our system does just that. The sharing of subword structure will hopefully facilitate the search process and also make it easier to deal with new, out-of-vocabulary, words. By pursuing merged common subword theories during search, we can mitigate the combinatorial explosion of the search tree, making large vocabulary recognition more manageable. Because we expect new words to share much common subword structure with words in our vocabulary, we can easily add new words dynamically allowing them to adop existing subword structures. In principle, we can even detect the occurrence of out-of-vocabulary words by recognizing as much of the subword structure as we can in a bottom up manner. To model constraints on the subword level, we are using a framework which we have named Angie. Angie is a layered system very similar to that proposed by Meng [1]. In Angie, we have five layers: a top level sentence layer, a word layer, a morphology and stress layer, a syllable layer, a phoneme layer and a phone layer. Adjacent layers are governed by context free rules and probabilities. Angie assigns probabilities to an internal node of a parse tree conditioned upon the node's child and left sibling and to a bottom leaf node conditioned upon the node's left column, its left sibling and that sibling's ancestors. An example of an Angie parse along with each column's score is shown in Figure 1. In our recognizer, we perform a bottom up A* search over possible phone strings, incorporating the acoustic score of each phone along with the score of the best Angie parse for the path up to that phone.
![]() Figure 1: An Angie parse of "what is u" along with column scores. A phone preceded by a hyphen denotes a phone which did not really occur in the acoustic data. We face several key research challenges relating to:
Our Angie based recognizer is still in its infancy. To date, we have built an operational forced search system to permit the training of the required acoustic and lexical models and the exploration of issues relating to the challenges outlined above. We have settled on the ATIS domain as the initial domain to study because there is much ATIS data available, along with performance benchmarks.
Reference[1] Meng, H.M., Phonological Parsing for Bi-directional Letter-to-Sound / Sound-to-Letter Generation, Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA., June 1995. |
| > Last Updated: December, 2004 | > Contact: mail@raylau.com |