The basic approach to scoring text documents for ranked information retrieval is to use the Vector Space Model (VSM). In the VSM, each document receives a score for how good a match it is to the query, and the highest scoring document is the best match, second-highest the second-best, and so on. It will take us a few classes to cover all of the VSM because it’s complicated, but we start here with the building blocks of the VSM scores–frequency measures calculated from the text in a document.
We’ve already seen how we tokenize and normalize a document. After those two processes, we have transformed a text document into a list of normalized tokens. To index the corpus, we do this for each document and update the index entry for each token we encounter. In Boolean IR, updating the inverted index is a two-step process. Since the keys are normalized tokens and the index maps each key to a collection of documents containing that notion, we first create an empty index entry if the token has never before seen and then we add the document’s ID to the collection of documents in that index entry. This key:value mapping in Boolean IR is the basic definition of the document index. In Ranked IR, we will need access to more than just the document IDs containing a search term, so we have to build a more complicated index called a frequency index. In addition to the documents containing a term, the frequency index can provide us with these frequency measures:
Frequency Measure | Notation | Definition |
---|---|---|
Collection Frequency | cf(t) or cft | The number of times, including duplicates, a term t appears in the corpus. |
Document Frequency | df(t) or dft | The number of documents in the corpus containing the term t |
Term Frequency | tf(t,d) or tft,d | The number of times, including duplicates, a term t appears in the specific document d in the corpus. |
Note that each of these three measures is defined as a count of some quantity. Each is a nonnegative integer (>= 0), and each is defined for every possible term t . The set of all possible search terms is our vocabulary, so the possible values of t are the vocabulary itself. If there are 100,000 possible search terms, then there are 100,000 distinct collection frequencies for this corpus. That’s not to say that each value is distinct; it’s to say that each collection frequency measures how many times one particular term appears in the corpus. If we read the corpus and find that ‘cat’ appears 3 times and so does ‘rabbit’, then it’s just a coincidence that both cf(‘cat’) and cf(‘rabbit’) both equal 3.
Some Examples
Consider this very model of a simple sample corpus below:
D1 | “Jack and Jill went up the hill.” |
D2 | “Three blind mice, three blind mice” |
D3 | “Who’s afraid of the big bad wolf?” |
D4 | “Goldilocks and the three bears.” |
With stopwords “and”, “of”, “the”, and “up”, the vocabulary with simple lowercase normalization is “jack”, “jill”, “went”, “hill”, “three”, “blind”, “mice”, “who’s”, “afraid”, “big”, “bad”, “wolf”, “goldilocks”, “bears”. For any of these terms, we could ask what its collection frequency or document frequency is. For instance, the document frequency of “three” is df(“three”) = 2 because the term “three” appears in two different documents. Its collection frequency is cf(“three”) = 3 because it appears 3 different times in the corpus–twice in one document and one time in another. cf(“cat”) = 0 in this corpus because “cat” does not appear at all; it’s not in the vocabulary. Similarly, the cf or df score for any stopword is 0 because we are ignoring stopwords here.
Likewise, for any term and any of the four documents, we can find the term frequency of a term in a particular document. tf(“three”, D2) = 2 because “three” appears twice in that document. tf(“three”, D4) = 1 because “three” only shows up once in that document, and tf(“three”, D3) = 0 because “three” does not appear at all in
Important Search Terms
Consider the table below showing the collection and document frequencies for two different terms.
term | cft | dft |
insurance | 10440 | 3997 |
try | 10422 | 8760 |
The cf values for each term are essentially identical. Both terms appear about the same number of times in the corpus. So which would lead to better search results? The answer is not to worry so much about the collection frequency but instead to look at the document frequencies. The term “insurance” appears in about half as many documents (4000 vs 8000) as the term “try”. So “insurance” is a term that has relatively few matching documents which makes it a better search term here. It’s better to appear in a few documents than many documents.
It’s better to appear in a few documents than many documents.
If we take this idea to the extreme and compare a term that appears in only 1 document to one that appears in every document in the corpus, we can see why the smaller df is better–it leads to more specific results, and its better to give the user specific results instead of many, many documents like Boolean IR does.
A small document frequency is better than a big one!
In Ranked IR, the custom is for high scores to mean good results and low scores mean bad results, but here a low df indicates a good result. We need to do something to transform a low df score into a high one…
The last observation to make here is that since “insurance” appears in so many fewer documents but has roughly the same collection frequency as “try”, then in most documents insurance appears in, it must have a higher term frequency! Not only is it a good search term because it leads to few documents, but those few documents are “more about” insurance than the “try” documents are about try! All things being equal, a high tf is better than a low tf. High-tf documents are more likely to really be about that term than low tf ones are.
High tf scores are better than low ones. High tf documents are “more about” their term!
Calculated Frequency Measures
In addition to the three above based on integer counts, there are two frequency measures that are calculated from those. They are shown below.
Frequency Measure | Notation |
---|---|
Inverse Document Frequency | idf(t) or idft |
Term Frequency-Inverse Document Frequency | tfidfdf(t, d) or tfidft,d |
We saw above that a low df indicates a good search term, but we would like to transform it into a high score. The inverse document frequency, idf(t), does just that. The name inverse suggests that idf(t) should be 1/df(t), but a different transformation is used. Before we get to that, though, note that 1/df(t) would transform high scores into low ones. Possible non-zero df values are 1, 2, 3, … which would be transformed into 1, 1/2, 1/3, … The increasing series of values gets transformed into a decreasing series. The problem is that the decreasing series decreases too quickly. The difference between successive df values 1, 2, 3, … is always 1. There’s not much difference gained by adding or taking away a term from a document 1 time. When those numbers are inverted, though, the difference is much more dramatic. Going from 1 to 2 cuts the inverse from 1 to 1/2, so that half the range of values in 0…1 are “not in play” for inverse scores.
It’s better to have a less steep decrease in scores, so a better transformation is used. “Better” here doesn’t mean there’s any deep mathematical or scientific reason for it; it means people have built search engines or similar IR systems and compared the results using different definitions of idf, and the definition we’re about to see works better in practice than the simplest definition we could imagine described above.
The Idf score’s definition
For a corpus with Nd documents, the idf score is defined by
idf(t) = log( Nd/df(t) ) if df(t) > 0, 0 otherwise
Instead of 1/df, Nd/df is used. Our sequence of values for df’s of 1, 2, 3, …, Nd gets transformed by that part of the equation into Nd , …, 1, or in other words the smallest possible value is 1 when a term appears in every document and the largest value is Nd when a term appears in only 1 document.
Everything written above about the large gaps is still true, though, and the logarithm function is used to “compress” those gaps. The example below shows how the df score can vary over six orders of magnitude while the log only varies from 0 to 6. A vast range is compressed to a much smaller range!
term | dft | idft |
calpurnia | 1 | 6 |
animal | 100 | 4 |
sunday | 1,000 | 3 |
fly | 10,000 | 2 |
under | 100,000 | 1 |
the | 1,000,000 | 0 |
One last note is that it doesn’t really matter which base logarithm is used. The example above showed base-10 to make the numbers easier to work out, but any base logarithm will compress the range, just by slightly different amounts. In practice log2 and log10 are used mostly for idf scores.
The tfidf score’s definition
We’ve seen a few things now:
- High tf scores are good
- Low df scores are good
- High idf scores transform good low df scores into good high numbers
There’s only 2 distinct concepts in this list of three things, for 2 and 3 are equivalent. If we want our ranked IR engine to calculate high scores for good search results, we need to use tf scores and idf scores. In fact, we can combine them into one score, the tfidf score, that provides us with a single measure of how good a search result we have for a single term.
The simplest way to combine them would be to multiply the tf score times the idf score, for we would have a high tfidf score whenever one or both of the tf and idf scores are large. In practice, though, a variation is used
tf-idft,d = (1 + log tft,d) log (N/ dft)
This term is basically just a slightly modified version of the tf times the idf we’ve already seen. The tf score in practice is found to have the same issues as the simple 1/df–the range is too wide. A document with a tf score of 2 is not twice as good a results as a document with a tf score of 1, so the log is use to compress that range. The smallest possible nonzero tf score is 1 and log 1 == 0, so if we did not have the 1 + term in the equation, we could get 0 scores no matter what the idf score is. This defeats our goal of having a high score whenever at least one of the tf or idf is large, but adding 1 amounts to shifting the range from >= 0 to >= 1. Mission accomplished!
There are other definitions of tfidf used that are similar to this one, so we have basically adopted the most common one for the presentation here.
Assume we have 10 million documents in the corpus and the word cat appears in one thousand of these. Consider a particular document containing 100 words wherein the word cat appears 3 times.
The term frequency, tf, for cat in this document is then 3.
Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the tfidf score is, from the formula (1 + log 3)* 4 = (1 + 0.48) * 4 = 5.9
Ranking with one search term
If we have just one search term, here is how we rank the results:
- Look up the term in the frequency index and get the collection of documents that contain that term. This is the result set.
- For each document in the result set calculate the tfidf score for the term in that document.
- Sort the documents in the results by tfidf score in descending order. The highest score will be the first document, second-highest the second, and so on.
When we have more than one search term, it’s not obvious how to score them all. Should we add the different tfidf scores together or do something else? The something else we’ll end up doing is described in the next topic by the vector space model!