Boolean IR Query Engine – Kimmer's Course Notes

We’ve seen that there are two “flavors” of information retreival–Boolean IR and Ranked IR. One difference between them is that ranked IR scores each search result in order to sort them from best result (highest score) to worst (lowest score). In contrast, Boolean IR returns all results in no particular order. Since the ordering doesn’t matter, there’s no need to compute scores in Boolean IR, and this turns out to be a big simplification.

There are more subtle differences between the two kinds in how they handle multiple query terms. We know that for a single query term, we should look that term up in the index (or look up its stem or lemma instead). For more than one query term, we have to determine how to deal with the different potential search results. First we will see how Boolean IR handles multiple search terms.

Processing Multiple Query Terms

We will assume that the end user queries the search engine simply by specifying search terms. Everyone knows that Google or other engines with a more sophisticated query processor can answer queries like “What time is the Super Bowl tonight?”, but we won’t consider those kind. The query engine for such a sophisticated search engine may need to first determine what kind of query the user submitted and then return results accordingly. For simple queries like we will consider, we need to use the index to determine which documents can be search results. For more complex queries like the Super Bowl one, we would have to transform the query to determine what is being asked, find a document with the information in it, and extract that information. That’s more advanced!

Step 1. Normalize & Tokenize the query

So let’s assume we have a mult-term query like “The search Informatics course” and wish to know which documents should be returned in Boolean IR. The first step is to determine which terms are in the query. We’ve already seen how to do this. We tokenize the query. After tokenization, we determine the terms to be ["The","search","Informatics","course"]. We have tokens, but the case may lead us to treat this query differently than one like “the search informatics course”. We should normalize the tokens to treat trivially different ones as equivalent. We end up with ["the","search","informatics","course"].

Step 2. Remove stopwords

We’ve seen stopwords defined before but haven’t really used the idea yet. Search engines typically maintain a list of terms that are not considered vocabulary words. The simplest kind of such a list are the very common words like “a”, “and”, “the”, etc. that do not really give any extra meaning to a document. It’s also possible to add domain-specific words to a stopping list. If our search engine were for a corpus of informatics course materials from all INFO courses, we may consider “informatics” and “INFO” to be undesirable search terms because they would return too many results. We could simply remove them from consideration by adding them to out stopping list.

If we assume we’re only stopping the common articles “a”, “and”, “the”, etc. we will have a query of [“search”,”informatics”,”course”] after removing stopwords.

Step 3. Look up each term in the index

Once the query is tokenized & normalized with stopwords removed, we know which vocabulary terms are in play for this particular query. We can look up each term in the document index, and for each term, we will have a collection of document id’s for that term. To be specific, let’s assume the results for each term are { "search" : [2,4,6,8], "informatics" : [1,2,3,4,5,6], "course" : [4,5,6,8,9] }.

Step 4. Multiple terms correspond to set operations

It’s mainly in this step where Boolean and Ranked Retrieval start to differ. Remember we’re just looking at Boolean IR for now. The typical approach in Boolean IR is to treat queries with multiple terms using set operations. For a single-term query, we will have one collection of document IDs as our result set, so we just return those documents to the user in whatever output format we use for our search engine.

With more than one search term, we need to compute the result set from the various collections of documents we obtained from the index in step 3. Boolean IR search engines treat multiple search terms as having an implicit AND. If a user queries with multiple vocabulary words, they are expecting to receive only documents containing ALL the vocabulary words. An alternative would be an implicit OR where the user would be expecting documents containing ANY of the terms, but that’s not how Boolean IR works.

With implicit AND, the result set is the set of documents in common among all the collections of document IDs from Step 3. We had { "search" : [2,4,6,8], "informatics" : [1,2,3,4,5,6], "course" : [4,5,6,8,9] } , and each list of document IDs only have [4,6] in common. [4,6] is the result set, or those documents are the search results returned to the end user. With a simple query, we could determine the results by inspecting each term’s list and seeing which terms all lists had in common. More formally, with implicit AND we are computing the intersection of each term’s collection of document ID’s.

For our first web search engine, that’s where it stops. If we have multiple search terms, we tokenize and normalize the query, remove stopwords, look up each remaining term in the index, and compute the intersection to return the result set. Actual Boolean IR engines can do more sophisticated things. They can recognize keywords like AND, OR, and NOT and act accordingly to compute results in different ways. If no keywords are supplied, then AND is assumed; that’s where implicit AND comes from. So “search informatics course” is equivalent to the query “search AND informatics AND course”. If the user instead queries “search AND informatics NOT course”, then the engine would recognize the different and return only results that do not contain the term “course”. In that case, the intersection of the first two terms’ documents would be computed, but only those results that do not contain “course” would be returned. In Boolean set operations terms, the set difference between the “search AND informatics” and “course” results would be computed: “search AND informatics” – “course”.

So far we’ve seen that AND corresponds to intersection of results and NOT corresponds to set difference. In the same way, you should be able to convince yourself that OR would correspond to a union. Programming your search engine to recognize these keywords and act on them is an extra-credit part of the project assignment. We are only required to implement the intersection calculation. So let’s see how to do that in Python.

Python Implementation

We have seen how to tokenize and normalize a document, so that is a solved problem. Whatever methods we use to tokenize and normalize a document in our corpus should be used to process a query. Not only does this “write once” approach save us from having to reinvent the wheel, but it also ensures that a query term gets treated exactly the same way as a term in a document. It’s the only way we can guarantee that a word in a search term gets mapped to the same index entry that a word in a document will.

So it’s safe to assume we can easily convert a query into its normalized tokens and look each one up in a document index. Our starting point, then, is that we need to know how to compute the result set from multiple collections of document IDs. We need to know how to carry out set operations in Python. Fortunately, Python provides a set as a collection type. We will first complete our tour of the Python collections and then see an example of Boolean IR.

Last of the Python Collections: Sets & Tuples

Sets and tuples are similar to lists in many ways, but each also has at least one key difference. A set represents a mathematical set, so it cannot hold duplicate values and has no particular ordering. This is in contrast to a list which can hold duplicate values and is ordered.

>>> empty_set = set()
>>> duplicate_list = [1,1,2]
>>> set_with_values = {1,1,2}
>>> print(duplicate_list)
[1, 1, 2]
>>> print(set_with_values)
{1, 2}
>>> set_with_values[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

In the above example we see that duplicates cannot exist in a set and that we also cannot access items by index in a set . We use a set when we specifically need the property of no duplicate values. We can compute all the normal set operations:

>>> one_set = {1,3,4,5}
>>> another_set = {4,5,8}
>>> one_set | another_set # this is union
{1, 3, 4, 5, 8}
>>> one_set.union(another_set) # another way
{1, 3, 4, 5, 8}
>>> one_set & another_set # this is intersection {4, 5}
>>> one_set.intersection(another_set) # another way {4, 5}
>>> one_set - another_set # set difference
{1, 3}
>>> one_set.difference(another_set) # another way {1, 3}

A tuple like a list supports access by index and allows duplicate values. So it is very similar to a list. However, a list can be changed (it is mutable ) while a tuple is immutable.

>>> empty_tuple = ()
>>> empty_tuple.append(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'
>>> tuple_of_values = (1,2.3,"informatics")
>>> list_of_values = [1,2.3,"informatics"]
>>> print(tuple_of_values[1])
2.3
>>> list_of_values[-1] = "i427" # we can change items in a list >>> print(list_of_values)
[1, 2.3, 'i427']
>>> tuple_of_values[-1] = "i427"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

We can see that we can use [] to access a value in a tuple by its index, but we cannot change that value. Likewise, we cannot add values to a tuple via append (or extend for that matter, even though it’s not shown). If we need the property of immutability, then we use a tuple. In I427, we will see that rows from a database are returned to us as tuples. And then we can recognize the word tuple from I308 and connect the concept of collection of values to row in a table in a database.

Traversing Sets & Tuples

Sets and tuples can be traversed exactly like lists. A dummy variable in a for loop can take on each value in the collection in turn.

>>> tuple_of_days = ("Friday","Saturday","Sunday")
>>> for day in tuple_of_days:
...     print(day)
...
Friday
Saturday
Sunday

Here is a similar example using a set.

>>> set_of_days = {"Friday","Saturday","Sunday"}
>>> for day in set_of_days:
...     print(day)
...
Sunday
Friday
Saturday

One thing worth noting is that the ordering of sets is not preserved. This is not a guarantee the set class provides the programmer.

Set Operations

Python provides built in operators and methods for calculating set intersections (&), unions (|), and differences (-). If we have two objects of type set, it’s as simple as set1 & set2, set1 | set2, or set1 - set2 for computing these. Likewise, an object of type set contains intersection(), union(), and difference() member functions we may use.

# look up each search term in the index
search_results = doc_index['search']
informatics_results = doc_index['informatics']
course_results = doc_index['course']

# compute the intersection for results two ways
result_set = search_results & informatics_results & course_results

result_set = search_results.intersection(informatics_results)
result_set = result_set.intersection(course_results)

Here’s a more real-world example that shows the steps “from the beginning” assuming we have a corpus of documents available.

doc_index = DocumentIndex(corpus)

query = "the search informatics course"
normalized_tokens = doc_index.normalized_tokens(query)

result_list = []
for search_term in normalized_tokens:
    result_list.append(doc_index[search_term])

# now compute the intersection of each set in result_list
# you do this part!