You are here

IR & Speech Recognition

This week's articles emphasized the tremendous difficulties that face the field of spoken document retrieval (SDR) and automatic speech recognition (ASR).  The articles all also emphasize that the benefits that could be realized from further development in these fields would be well worth the difficulties in getting there (to put it conservatively).  Tempering this pessimism somewhat is the opinion expressed in James Allan's article.  He points out that even when there are significant speech recognition errors, such a retrieval method is still highly effective- "even a 50% word error rate means that at least half of the words are correct" (2).  In the context of document searches, his argument seems to hold water.  However, the primary reason that this is the case is because he is querying documents with a substantial amount of text, which means the search terms will appear repeatedly and in slightly different forms/tenses, and that this redundancy helps account for any errors in speech recognition.  Additionally, a full document frequently provides context which a query could account for, aiding in what he calls "word sense disambiguation" (2).
But the internet is not a collection of large text documents, and Allan acknowledges this.  He concedes that the shortcomings in ASR can be found when searching short pieces of texts.  While Allan doesn't mention Tweets or titles of audio/visuals, they'd obviously fall into this category (to be fair, Allan was writing before Tweets or multimedia were the pervasive internet outlets that they are).  So, in the context within which he is working, his ideas that ASR issues are "solved problem[s]," that context does not resemble the internet as it exists today.  As is pointed out in "The TREC Spoken Document Retrieval Track: A Success Story," modern retrieval modalities do not have the redundancy of traditional models, and a single syntactical error in a modality such as question answering can be "catastrophic."
The theme underlying these articles becomes clear in "Building an Information Retrieval Test Collection for Spontaneous Conversational Speech."  As with the previously mentioned articles, the authors in this case seem to take as implicit in their study that spoken informational retrieval is developed to the point that it can search traditional text documents, and even audio in the form of dictation or newscast.  However, those contexts fail to account for all of the other factors that define how people actually talk.  Tone, inflection, accent- all of this still remains to be accurately searched, yet this is by far the most important corpus of audio that exists.