You are here

In Re: Text Processing

Because words can occur in an infinite number of orders and/or patterns, and because computers lack the ability to think critically to deduce the “meaning” of a word individually or within its the broader context, the method of Text Processing wherein a computer is programmed to index terms, or representations of text, rather than the text itself was born.  This text-to-representations-of-text move was a subtle but a significant distinction, allowing for effective searching that required no “knowledge” about the terms in the corpus being searched. 
When determining which document/webpage/etc. is most relevant to a query, a computer will generally order the results based upon the number of occurrences of the word(s) it’s been instructed to search for.  This basic process of relying solely upon occurrence-tallying in determining relevance is problematic because a mere 50 words comprise 40% of all word usage in English, a language with a total number of words approaching one million.  This problem was solved in part with the realization that about one-half of words in a particular set would occur only once, and that these words could be equally-if not more- relevant to a query. Zipf’s Law proved that the less frequently a word occurred in a body of text, themore significant it was in determining relevant results from a query.    
This marked a major addition to the method of simply counting word occurrences and ranking accordingly; taking this notion to its logical extreme, then, the most important word in a body of text would be a word that occurred only once.  Such words are known by linguists as “Hapax Legomena.”
Similarly, the Language Modeling approach begins with the common sense premise that a query is most likely to generate relevant results if the user enters words likely to me found in the document(s) they are seeking, and not in any other documents.  Under the Unigram Language Model, the computer searches for those words only, completely ignoring context.  As mentioned in conjunction with the Text Processing model above, this basic method is fairly effective for most text searches, as it generally doesn’t matter in such searches, for example, where the query word appears in a sentence. 
Just as with the Text Processing model, the Language Modeling approach also operates under the assumption that the individual making the query does so with some knowledge as to which words- if found in a search- would be most likely to locate the documents most relevant to them.  This assumption seems to have several problems.  Firstly, many times the reason for searching for information about something is precisely because you don’t know what words will be relevant.  Secondly, only relatively small (and ever decreasing) subset of the population understands Boolean searching, which would be the type of search best equipped for either of the above models, as it allows someone to give the computer at leastsome ability to take context and word placement into account.
This post is 487 words long.