We’ve already seen normalization before. We need to recognize that words like “Care”, “care”, and “cARe” (along with any other variants) represent the same term. The way to enforce that similarity is to store the term in a standard, normalized form. Stemming and/or lemmatizing give us additional, more sophisticated ways to normalize tokens. In addition to all the terms above being the same up to upper- and lowercase differences, we can recognize they’re an example of a present-tense verb, care. Most search engines treat related tenses or forms like cared and caring as different variants of the same form. In the same way, singular and plural nouns (like “car” and “cars”) should be treated as different variations of the same form. We need a way to normalize our search terms to include this more general idea, and stemming and lemmatization let us do that.
Stemming a search term converts that term into a stem. Any terms that have the same stem are considered equivalent. Lemmatizing a search term converts that term into a lemma. Any terms that have the same lemma are considered equivalent. That’s easy, right?
There are various different algorithms for stemming, and each is called a stemmer. There are various approaches for lemmatizing, and each is called a lemmatizer. The difference between stemmers and lemmatizers is one of philosophy. Lemmatizers are based on linguistic models and follow a process to convert one word into its root word called a lemma. Stemmers are more mechanistic and follow a set of rules for truncating words (removing their endings) to convert them into a root–a stem– that may not necessarily be a word anymore.
The main conceptual difference between stems and lemmas is that a lemma is always a word while a stem may or may not be.
The most widely used stemmer is called the Porter Stemmer, and it has a webpage. It’s been around forever (since the 70s), and its algorithm is described by a set of rules that are successively applied to the word. Each rule, if its criterium is satisfied, reduces the length of a word. After the rules have all been tried, the word has been reduced to its stem. A simple example of a stem is that “cats” becomes “cat” after stemming. In this case, the stemmer can be used to make singular and plural nouns equivalent. Likewise “ponies” becomes “poni” which is not a word. “pony” remains “pony”. This result shows the stemmer is not perfect, and the fact that “pony” documents would not be returned for a “ponies” search is an example of a false negative–a document that should be returned would not be returned. The Porter Stemmer can also return incorrect results, false positives. For example organ and organization both stem to “organ”, and return “organ” documents for “organization” searches or vice versa is incorrect behavior.
The most widely-used lemmatizer is the Wordnet lemmatizer. It’s process is not as straightforward to describe as the Porter Stemmer above.
Converting a word into its stem or lemma is straightforward using NLTK. The stemmers NLTK provides all work the same and provide a .stem() method that takes a word as its argument and returns a stem.
import nltk term = "programmer" stemmer = nltk.PorterStemmer() print(stemmer.stem(term)) # prints "programm"
The lemmatizers are similar, providing a lemmatize() method, but the method takes an additional parameter where one is able to specify the part of speech (noun or verb) that the word is. This dependence on grammar illustrates how lemmatizers have a deeper linguistic connection than stemmers do.
import nltk term = "feet" lemmatizer = nltk.WordNetLemmatizer() lemma = lemmatizer.lemmatize(term) # lemma is 'foot' term = "processed" lemma = lemmatizer.lemmatize(term) # lemma is still 'processed' lemma = lemmatizer.lemmatize(term, 'v') # lemma is now 'process'
This “feature” of lemmatization gives us a recipe for lemmatizing a word whether or not we know its part of speech:
import nltk lemmatizer = nltk.WordNetLemmatizer() def to_lemma(term): global lemmatizer lemma = lemmatizer.lemmatize(term) if lemma == term: lemma = lemmatizer.lemmatize(term, 'v') # double-check return lemma
Construct Once – Use Repeatedly
It’s important to note that we only construct a new lemmatizer once yet may use it repeatedly to convert terms to lemmas. We would take the same approach with a stemmer. They can be expensive to construct because of the data they initialize, so it doesn’t make sense to construct them repeatedly if we plan to use it repeatedly. Construct them once somewhere in your code before calling them elsewhere. In the lemmatization example above, it’s constructed as a global variable and used within the function. We saw earlier that this was not such a good approach, but it’s better than repeated construction. For the next topic, classes, we’ll see a better way to do this by making the stemmer or lemmatizer be part of a class; an object (instance of the class) will have access to it.