Document Index ADT – Kimmer's Course Notes

How do you find which pages in a programming textbook are about for loops? There are two good choices:

Read the entire book and write down which pages match
Look up “for loop” in the index to get the list of pages

Either approach is good. Approach 2 is faster (more time-efficient). Search engines take the same approach. They look up the search terms in the inverted index. The only problem is they have to build that inverted index before they can use. To build an index, you have to read the book once in order to figure out which search terms appear on which pages. For a search engine to build an inverted index, it has to read (scan) every document in the corpus one time. After building the index, it is ready to efficiently return search results.

A review of what we’ve seen so far

Here is a review of concepts and terminology so far:

Corpus – a collection of documents. The corpus is what is searched. Search results are documents
Tokenizing – Converting a document into a list of tokens or terms
Normalizing – Converting a token into a standardized form so trivially different tokens can be considered equivalent
Lexicon and Vocabulary – two synonyms for the set of possible search terms
Stopwords – some words are not allowed in the lexicon (like “a”, “and”, and “the” when Googling). They are omitted/ignored when searched for.

How would we decide which document was the best match to the search term(s)?

If we don’t answer that question, we would return documents that contain the search term(s) in no particular order. This approach is called Boolean retrieval or Boolean IR. If we do answer the question and return documents in order from best to worst match, we are ranking them. This approach is known as ranked IR or ranked retrieval. For now, we start with Boolean IR to learn the basics of IR and then move on to ranked retrieval.

We’ve seen that the corpus would be stored in a persisten store like a relational database or perhaps a No-SQL store depending on the particular search engine and the scale of the corpus. Let’s assume it’s a relational database since we haven’t deal with a No-SQL database in the INFO major. In the relational database there would be some kind of table with one document per row, and the primary key of the table would be the document’s ID.

Below is a simple sample corpus (say that 26 times fast!) showing the idea. The document ID is on the left and the text of a single document is associated with each ID.

For web search, we could use a URL as the ID, or we could use an autogenerated integer as an arbitrary ID. The purpose of the ID or any primary key is to uniquely identify a row in a table. Because any value of a primary key references at most one row, a mapping from ID to document is defined by the key or the ID. For efficiency of repeated searches, the DBMS would implement this as a hash table where the ID would be a key and the document itself (the row the ID points to) would be the value. Note the language used to describe the mapping. It’s typical to refer to this as key-value mapping or a key-value pairing. There is at most one value associated with every key. When the DBMS maintains such an efficient mapping from keys to rows, the database is using an index. We can also say that the documents table is indexed by the document ID, the primary key of the table. This what is meant by an index in computation; it is a key-value pairing where any key can be immediately and efficiently associated with a value.

The problem is that search engine users don’t search by ID. They don’t enter document IDs into the search bar. If they already knew the document ID, they wouldn’t be searching! Instead they search by keywords. If all our search engine maintains is the document table (i.e. the corpus), then finding documents containing a keyword would involve scanning (looking at) every document in the corpus. This scanning operation is a linear operation (a loop over all rows/documents), and since a corpus is expected to be large, it would be a slow operation. Moreover, if a second search is attempted, we would have to re-scan the entire corpus and keep track of matching documents. Search results need to be returned quickly; it’s too slow and too inefficient to scan the corpus each time a query happens. We can trade storage space for time, and time is important in IR because users can’t wait.

Search results need to be returned quickly; it’s too slow and too inefficient to scan the corpus each time a query happens. Instead we scan the corpus once to build an inverted index and then quickly look up search terms in the index.

The Inverted Index

Since it’s slow and inefficient to scan the corpus for each search, we read each document in the corpus one time in order to build an inverted index. Once we’ve read the entire corpus, we can look up search terms in the inverted index to find out which document IDs contain those terms. Another simple sample corpus (left) and its accompanying inverted index (right) is shown below. The ID column in the Inverted Index table is not relevant for our discussion here!

Such an index could be constructed by reading in a document, tokenizing it, and using a stopword list to remove unnecessary tokens. For every remaining token (for every token that’s not a stopword), we make sure the index is recoding that the token (the key in the index) is in a certain document (described by its document ID, the value in the index). An inverted index like the one above where a search term is a key and refers to a collection of document IDs containing that term is called a document index. Document indices are suitables for Boolean IR. We will eventually see a more complex kind of inverted index called a frequency index suitable for ranked IR.

A document index is a key-value mapping where the key is a search term and the value is a collection of IDs of documents that contain that search term.

Abstract Data Types (ADTs)

This material is hopefully a review because ADTs have been explained to you in I210, I211, or some other course. An abstract data type (ADT) is a way of describing data structures in a way the is independent of any particular programming language and independent of any implementation details. All that sentence means is that the description of an ADT is likely to be “just words” instead of code. In more detail, an ADT is a description of data and the operations allowed on that data. So ADTs describe not just information (the data), but what is allowed to be done with that data. For a more detailed description, see this page.

Example ADTs

Before we return to inverted indices, here are two examples you’ve likely seen in your previous classes.

A stack is an ADT. A stack is a collection of values. Note that it’s not specified whether that collection is an array, a list, a hashtable, a million separate variables, etc. An abstract data type is abstract. The description is above those kinds of details. The allowed operations on this collection of data are:

push(value) – add a new value to the collection of data
pop() – return a value from the collection of data. Stacks are LIFO (last-in first-out) so that the value returned is the one most recently added
is_empty() – indicated (true or false) whether the stack currently holds any values or not

The second example we’ll see is very simple once we’ve written all this for a stack. A queue is another ADT that is exactly like the stack except it’s is FIFO (first-in first-out).

In both examples, no implementation details are indicated. The data (collection of values) are described and the three allowed operations on that data are also indicated.

The Document Index ADT

A document index is also an abstract data type. The information stored is the key-value mapping where the key is a token or search term and the value is a collection of IDs of documents containing that search term. The allowed operations on that data are:

add a new entry (new token or search term) to the index
add a new document ID to a particular token’s entry in the index
get the list of document IDs containing a particular token (i.e. look up a search term in the index)
determine whether a particular document contains a particular search term

In I427, we consider the corpus to be static. Once we have amassed our corpus of web pages, we assume those web pages don’t change. In real life, web pages get edited or they serve dynamic content (like a news site or a social network does) that is either constantly updating or at least regularly being updated. Some pages might disappear entirely from the web. We have the luxury of pretending this doesn’t happen, so the list of operations above may be sufficient. Google does not have that luxury. Their index is more dynamic, so additional operations allowing for removing document IDs and removing entire entries may be necessary in that case.