Python Collections – Kimmer's Course Notes

Python has a variety of built-in data structures that we will use in this course—lists, tuples, sets, and dictionaries. All of these are similar in that they represent collections of objects, but they each have their own strengths and weaknesses relative to each other for different tasks. They are also all objects, and so they have their own member functions that we may use to access or manipulate the data they contain. All four of the data structures mentioned above are heterogeneous which means they can hold a collection of different data types. This in contrast to a Java array, for instance, which must declare the type it can hold. For instance a Java array of integers can only hold integer values while a Python list could hold an integer, a floating point number, and a string or any other object we would care to place in the collection.

A good reference on all the information for each collection below is available in the official Python documentation here.

Lists

A list in Python can be created in multiple ways. Two ways allow us to create an empty list to which we can add items later. A third way allows us to create a list that is initialized to hold prescribed values. This code snippet illustrates all three ways:

empty_list = list() # we call the list() constructor here
another_empty_list = [] # this is a more typical way to create an empty list 
list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated list of values to initialize

The initialization of another empty list above is said to be more pythonic since it’s the typical way to program such a statement in Python. The previous line initializing empty_list would be less typical in a program, but it does illustrate to us that list is a class while the objects are instances of that class. The list class is part of Python, so it is a built-in class. Since it’s a class, it has member functions that we can use. One of the most common is list.append() which lets us add to the end of the list:

another_empty_list = [] # this is a more typical way to create an empty list 
list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated list of values to initialize

another_empty_list.append(1) # another_empty_list now equals [1]
list_with_values.append("i427") # now holds [1,2.3,"informatics","i427"]

There is also a list.extend() method which lets us append multiple values from one list onto another

another_empty_list = [] # this is a more typical way to create an empty list 
list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated list of values to initialize
another_empty_list.append(1) # another_empty_list now equals [1] list_with_values.append("i427") # now holds [1,2.3,"informatics","i427"] 
another_empty_list.extend(list_with_values) # now holds [1,1,2.3,"informatics","i427"] 

another_empty_list + list_with_values # + operator acts just like extend, now holds [1,1,2.3,"informatics","i427"]

Finally, the code below illustrates a key property of a list—random access by index. Python lists act like Java arrays in that each item has a position in the sequence. The first item is at index 0, the second at index 1, and so on.

>>> list_with_values = [1, 2.3, "informatics"] # list is heterogenous. Comma-separated 􏰋→ list of values to initialize
>>> print(list_with_values[0])
1
>>> print(list_with_values[1])
2.3
>>> print(list_with_values[2])
informatics
>>> print(list_with_values[-1])
informatics
>>> print(list_with_values[-2])
2.3
>>> print(len(list_with_values))
3

The previous example illustrates that we can access using negative indices: -1 is the last item, -2 the next to last, and so on. Even though list is a class, we determine the length of a list using the built-in len() function and supplying the list as an argument. We do not use a member function in the list class to determine the length. Access by index is one of the key traits of a list, so whenever we are holding a collection of data and we care that an item is first, second, third … we would choose list as our container.

Traversing a list

It is common in computing to need to do something to every item in a list. To accomplish these repeated (once per element) operations, we typically use iteration–a for loop– in programming. Iterating over these collections in Python is similar to using enhanced for loops in Java. A dummy variable acts as a placeholder for each value in the collection in turn. We are free to choose its name. The dummy variable is not an index; it is a value in the collection. Short variable names like i are appropriate for an index but not a value. item or value are appropriate generic names, or a variable name that describes the contents o the collection would be appropriate. The only times single letter variable names would be appropriate would be when implementing mathematical operations or formula, like adding up elements, on a list. Then names like i,j, or k would be appropriate for int values or x, y, or z may be appropriate for float.

>>> list_of_names = ["Larry", "Moe", "Curly"]
>>> for item in list_of_names: # generic dummy variables ... print(item)
...
Larry
Moe
Curly
>>> for name in list_of_names: # more meanginful dummy name ... print(name)
...
Larry
Moe
Curly

Python provides the enumerate() method for cases when we need the value in the list as well as the index of that value. In this case, we get to name two dummy variables. The first is the index, and i is an appropriate dummy variable name in this case. The second is the value in the list, and we should name it as described above.

>>> list_of_names = ["Larry","Moe","Curly"]
>>> for i,name in enumerate(list_of_names):
...     print(i)
...     print(name)
...
0
Larry
1
Moe
2
Curly

What is happening in the previous example is that the enumerate method is yielding (i.e. returning each iteration) a tuple of 2 values which are simultaneously assigned to i, name for each iteration of the loop. Also note that since i is an index, it’s an appropriate variable name in this case. We don’t know what a tuple is yet, so let’s find out.

Sets and Tuples

Aset and a tuple are similar to a list in many ways, but each also has at least one key difference. A set represents a mathematical set, so it cannot hold duplicate values and has no particular ordering. This is in contrast to a list which can hold duplicate values and is ordered.

>>> empty_set = set()
>>> duplicate_list = [1,1,2]
>>> set_with_values = {1,1,2}
>>> print(duplicate_list)
[1, 1, 2]
>>> print(set_with_values)
{1, 2}
>>> set_with_values[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

In the above example we see that duplicates cannot exist in a set and that we also cannot access items by index in a set . We use a set when we specifically need the property of no duplicate values. We can compute all the normal set operations:

>>> one_set = {1,3,4,5}
>>> another_set = {4,5,8}
>>> one_set | another_set # this is union
{1, 3, 4, 5, 8}
>>> one_set.union(another_set) # another way
{1, 3, 4, 5, 8}
>>> one_set & another_set # this is intersection {4, 5}
>>> one_set.intersection(another_set) # another way {4, 5}
>>> one_set - another_set # set difference
{1, 3}
>>> one_set.difference(another_set) # another way {1, 3}

A tuple like a list supports access by index and allows duplicate values. So it is very similar to a list. However, a list’s value can be changed (it is mutable ) while a tuple is immutable.

>>> empty_tuple = ()
>>> empty_tuple.append(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'
>>> tuple_of_values = (1,2.3,"informatics")
>>> list_of_values = [1,2.3,"informatics"]
>>> print(tuple_of_values[1])
2.3
>>> list_of_values[-1] = "i427" # we can change items in a list 
>>> print(list_of_values)
[1, 2.3, 'i427']
>>> tuple_of_values[-1] = "i427"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

We can see that we can use [] to access a value in a tuple by its index, but we cannot change that value. Likewise, we cannot add values to a tuple via append (or extend for that matter, even though it’s not shown). If we need the property of immutability, then we use a tuple. In I427, we will see that rows from a database are returned to us as tuples. And then we can recognize the word tuple from I308 and connect the concept of collection of values to row in a table in a database.

Traversing sets and tuples

Sets and tuples can be traversed exactly like lists. A dummy variable in a for loop can take on each value in the collection in turn. Tuples have an ordering, so a tuple would be reversed in order of index. Sets do not have an order, to the items in a set occur in no particular order. If we need to traverse items in a certain order, we would need to sort the set’s values, for instance.

>>> tuple_of_days = ("Friday","Saturday","Sunday")
>>> for day in tuple_of_days:
...     print(day)
...
Friday
Saturday
Sunday

Here is a similar example using a set.

>>> set_of_days = {"Friday","Saturday","Sunday"}
>>> for day in set_of_days:
...     print(day)
...
Sunday
Friday
Saturday

Once again, one thing worth noting is that the ordering of sets is not preserved. This is not a guarantee that the set class provides the programmer.

Dictionaries

Dictionaries are a built-in data type in Python that provide some functionality of a hash table (the Java HashMap class) but are more properly called an associative array. One possible conceptual model of a dictionary is that it is a list that is indexed by an arbitrary key rather than an integer number from the sequence 0, 1, … length of data structure minus one. So the basic access pattern for a dictionary is dict object_name[key] which looks just like accessing a list item. The indication that dict object is a dictionary and not a list comes when the object is instantiated; a list object is initialized with [ ] while for a dictionary you use { }, i.e. dict object_name = { }. With a list, we insert items at a desired index. For instance append() builds up a list item by item with the next index depending on the number of items in the list so far. Other methods we haven’t covered let you insert items at other places in a list, but there is a performance penalty to do this. (Think back to linked lists in I210/I211 where one must traverse a list to find a given position.) Dictionaries do not incur this performance penalty, so they provide fast lookup and insertion of arbitrary keys. So with a dictionary, we can add items in arbitrary order, and adding each item will be similarly efficient.

>>> list_object = [1, 2, 3]
>>> dict_object = {} # empty dictionary
>>> dict_object[50] = 1 # assign the value 1 to the key 50
>>> dict_object[0] = 2 # assign the value 2 to the key 0
>>> dict_object["i427"] = "Search informatics" # dictionaries keys don't have to be an integr >>> list_object[0]
1
>>> dict_object[0]
2
>>> list_object[1]
2
>>> dict_object[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 1
>>> dict_object[50]
1
>>> dict_object[50] = 2 # overwrite the value for key 50

Note how easy adding an object to the dictionary is by simple assignment, and note the error message we receive if we try to access a non-existent key. If we evaluate dict object we see that it’s presented as a comma-separated list of items with each item being of the form key:value. There is nothing special about the ordering of the key:value pairs, and there is no guarantee they will be presented to the programmer in any particular order. All the collection types we have seen (list, tuple, set, and dict) hold collections of items, but in all of them except dict an item is a single value. In a dict the item is a pairing of two things—the key and the value.

>>> dict_object
{0: 2, "i427": "Search informatics", 50: 2}

At this point, it’s important to remember that the “indices” (really the keys) don’t have to be integers; they can be virtually any object (technically any object that’s hashable, but that’s beyond our scope here). So, as an example, let’s use a dictionary to hold some data in a more typical situation. Let’s say we have 5 character strings that act as primary keys for a table with 3 other columns of data – the first 2 of which are floating point while the last is a string, and also let’s store the column names in our dictionary:

>>> our_table = {}
>>> our_table['column names'] = ['age','salary','name']
>>> our_table['12345'] = [20.0,15000.,'Kimmer']
>>> our_table['45678'] = []
>>> our_table['45678'].append(25.5)
>>> our_table['45678'].append(50000.)
>>> our_table['45678'].append('Ramachandran')

This example shows that dictionaries can hold most anything. The values can be most any old object including lists. The second value is in fact begun as an empty list, and then we keep adding to it. Every time after the initial invocation of our table[’45678’] that initializes that value, invoking it returns out list to us, and with that list object, we’re free to use any of the list methods we need including appending to add items to the value. Values are not immutable; they are mutable and can be changed.

In the example above, the table of data was stored with one row per dictionary entry indexed by the key of the table. This is almost always a good way to store tabular data. The clunky part of the implementation above is that the column names are stored in their own dictionary entry, and column names aren’t really an actual row in a table. They’re metadata instead of data in the table. So let’s fix that. If every row in the table has three values for each column, we could explicitly indicate that for each row.

>>> our_new_table = {}
>>> kimmer_row = {'age' : 20.0, 'salary' : 15000 ,'name' : 'Kimmer'}
>>> dr_r_row = {'age' : 25.5, 'salary' : 50000 ,'name' : 'Ramachandran'}
>>> our_new_table['12345'] = kimmer_row
>>> our_new_table['45678'] = dr_r_row

So in this case we have made a dictionary of dictionarys instead of a dictionary of lists. Either is viable, but the advantage of the second approach is that we don’t have to remember that the ’age’ column is first, ’salary’ second, etc. We can access columns by name which makes our programs considerably easier to write and to read. The readability of a program doesn’t seem that important to many students at first, but once you realize large programs are written in pieces at many different times, the ability to come back to a partially completed program after a big gap of time and quickly get back up to speed with it is invaluable. So here’s are good and not-so-good approacesh to accessing data in a table compared:

>>> print(our_new_table['12345']['name'])
Kimmer
>>> print(our_table['45678'][2]) # we "just have to know" that the name is stored third. It's less effective.

In either case, students need to come to terms with the idea of multiple indexing (the need for 2 or more [][] when accessing items in the table). For tables, which are 2D data structures, you have to specify rows and columns. We store tables as collections of rows, so the row index is first, and then we access each row by column. So the indexing is [row][col], and we can read that from left to right. The first access is row and the second is col. In even more detail, when we access our new table by row via our new table[’12345’] we need to be thinking

our_new_table is a dict, so we access it by key
our_new_table[’12345’] returns the value associated with that key.
The value is another dict , so (our_new_table[’12345’]) can be used in the same way a plain old dictionary object can
(our_new_table[’12345’])[’name’] will then get the value associated with the ’name’ key in our dictionary containing the row
But the () are unnecessary and detract from readability, so we should instead write it as our_new_table[’12345’][’name'] The same reasoning would apply to our table[’45678’][2] except (our table[’45678’]) is a list and we must access it by index, so the next thing in [] must be a numerical list index.

Traversing a dictionary

If we use the same for-loop approach we’ve used so far with lists, tuples, and sets, we’ll end up with the keys of the dictionary stored in the dummy variable during each iteration. Note that the teams have changed for 2 out of the 3 players since this code was written. NBA churn, baby!

>>> dict_of_teams = { "LeBron" : "Lakers", "KD" : "Warriors", "CP3" : "Rockets" }
>>> for key in dict_of_teams:
...     print(key)
...
LeBron KD
CP3

Of course, if we have the keys, we can use them to access the values:

>>> for key in dict_of_teams:
...     print(key)
...     print(dict_of_teams[key])
...
CP3
Rockets
LeBron
Cavs
KD
Warriors

But such code is not very Pythonic. The root of the problem is that dictionaries are more complicated since they involve not one value per item in the collection but two–the key and the value. The first example is good. If we only need the keys, that is how we traverse a dictionary. If we need the key and the value, then, Python provides an items() function that returns both in dummy variables for every iteration.

>>> for key, value in dict_of_teams.items():
...     print(key)
...     print(value)
...
LeBron
Lakers
KD
Warriors
CP3
Rockets

Finally, it is worth mentioning that if we don’t need the keys, we shouldn’t bother to access them. Python also provides us with a values() function in this case, but it’s much less commonly-used than items(). Incidentally, there’s also a keys() function that is equivalent to the first example. It’s an antipattern, so it’s not illustrated here.

>>> dict_of_teams = { "LeBron" : "Lakers", "KD" : "Warriors", "CP3" : "Rockets" }
>>> for value in dict_of_teams.values():
...     print(value)
...
Lakers
Warriors
Rockets

To achieve this form of iteration, the .items() member function of the dict class yields a 2-item tuple just like enumerate() did with a list. On the other hand the member functions .keys() and .values() return either desired part of each dictionary entry. Referring to the first dictionary example, we can see that it is not necessary to invoke .keys() :

>>> dict_of_teams = { "LeBron" : "Lakers", "KD" : "Warriors", "CP3" : "Rockets" }
>>> for key in dict_of_teams:
...     print(key)
...
LeBron
KD
CP3
>>> for key in dict_of_teams.keys():
...     print(key)
...
LeBron 
KD
CP3