| Truncation and Wildcards
One of the first mistakes in query formulation is not using truncation sufficiently. Let's look at this question in regards to our subject, bird. Accounting for singular and plural cases of an object is easy to overlook; but, if done, can act to unduly restrict the universe of documents in which you will be conducting your search. Using AltaVista again, here are the document counts for the single and plural versions of bird: By using either only bird or birds as our subject, we would eliminate half or so of the potential documents that we'd like to use as our search basis. We could use both bird and birds as query terms, but that takes up valuable keyword slots. The better way to handle this problem is through truncation. Truncation (or Stemming), is applying a wildcard character after the first few letters in a term (the "stem"). The asterisk (*) is the almost universally accepted truncation wildcard. Generally, you must also have a minimum of three characters at the beginning of the word as your stem basis. Once marked for truncation, then any matching characters after that will be picked up in the search query. Some search engines do stemming and truncation for you if you pick the right option on the search form. Some engines don't support truncation at all. In any case, using the asterisk wildcard will generally be ignored or you'll get a query format error if the search engine doesn't support it. Remember, ANY words with characters after the stem will be matched to your query term if the search engine supports truncation. Thus, if we stem bird*, our search will match on the words bird, birds, birding and birdbrain. Posing bird* to AltaVista we now get these document counts: Note the document count is a bit lower than the total for the individual words bird, birds, birding and birdbrain. There are minor errors in how search engines retrieve word stems. But they are of a smaller magnitude than ignoring singular and plural cases altogether in the query, and seem to be a minor price to pay for being able to eliminate another keyword (birds, in addition to bird) from the search. As you first begin to use truncation you need to be aware of unintended consequences. In the case of the stem bird* there are relatively few unwanted words (birdbrain) picked up in the search. But let's look at another of the objects, city, in our mystery bird sample problem. To stem and pick up the plural form
of city, cities, we would need to specify cit*. But look at some of the
words this stem specification would match:
The cit* stem clearly picks up way too many unwanted words. Stemming tends to work best when the actual stem is longer, when plurals are represented by an added '-s' (as opposed to '-ies' or other forms), and the stem itself is not a root to many other common words. With just a little thought, however, truncation is easy and can pay useful dividends in properly scoping your query with a minimum of keywords. We highly recommend its use. |