We’ve seen that data mining tasks can roughly be broken down into supervised learning tasks or unsupervised learning ones. “Supervised” means we have some feature whose value we know for the training data we use to build our model, and we use those known values to build a model that can help us predict similar values for new data. This supervised learning approach is also often called predictive modeling because of its focus on prediction. More specifically, we call it classification when predicting a category (a.k.a. nominal feature or factor) and regression when predicting a number. We’ve seen this picture before that summarizes this main idea of I421.

Once we have a model, we can ask it to make predictions, and there are two reasons we might need a model to predict for us when data mining:
- To give us a prediction where we don’t actually know what the answer is. In this case, we have learned from the data. That’s really the ultimate goal of predictive modeling. We want to build a model that’s good enough that we can use it on data where we don’t know the answer and get a useful, accurate result.
- Before we can feel confident using a model on uknown data like the big, ultimate goal above describes, we need to use data to evaluate how good the model is. In fact, this model evaluation is a key part of the data mining process that we’ve seen this semester.
The two goals above are fundamentally different. To get a prediction we can act on, we generally aren’t sure about the answer. It’s unknown. If it’s a model where we’re predicting the presence or absence of cancer, you can imagine it would be really important to be able to predict whether or not someone has cancer. And you’d really be concerned that your model can make an accurate prediction. To get that point where we know the model is accurate, then, we have to evaluate the model to measure how good it is. To do that, we should use data where we know the answer so we can see how well the model’s predictions match the known correct answers for that data.
A common approach for model evaluation is to have a particular metric in mind that quantifies how good your model is. Then data mining boils down to building a bunch of models and using the one with the best metric for your problem at hand. This approach is probably what most of you will end up doing for your projects. You’ll find a dataset you’re interested in, get the data clean so you can build models with it, and then build a bunch of models and compare them. To do this kind of work, you have to decide in advance how you will compare then–which metric you want to use. We’ll see several different evaluation metrics this semester, but today we’ll focus on only one: accuracy.
A model’s accuracy is the fraction of instances it predicts correctly.
To be definite, let’s consider the contact lenses dataset. Weka provides this dataset in its data folder, so you’re welcome to open Weka and explore this simple dataset. It has the four categorical features age, spectacle prescription, astigmatic, and tear production rate. The class feature is contact lenses, and the three possible classes are hard, soft, or no. The full dataset in Weka has 24 instances, so it’s rather small. To make it even simpler, let’s look at the 10-instance version below:
| Age | Spectacle Prescription | Astigmatic | Tear Production Rate | Contact Lenses | 
| young | myope | no | reduced | none | 
| young | myope | no | normal | soft | 
| young | myope | yes | reduced | none | 
| young | myope | yes | normal | hard | 
| young | hypermetrope | no | reduced | none | 
| young | hypermetrope | no | normal | soft | 
| young | hypermetrope | yes | reduced | none | 
| young | hypermetrope | yes | normal | hard | 
| pre-presbyopic | myope | no | reduced | none | 
| pre-presbyopic | myope | no | normal | soft | 
If we have a model that can predict the value of the Contact Lenses class given Age, Spectacle Prescription, Astigmatic, and Tear Production Rate values for an instance, we have a classification model that does what it’s supposed to. Let’s imagine we have a simple model that always predicts “Contact Lenses = none” no matter the values of the 4 features. We can evaluate the accuracy using our small dataset above by comparing the predictions against the correct answers we know.
| Predicted Class | Actual class | Right or Wrong | 
| none | none | Right | 
| none | soft | Wrong | 
| none | none | Right | 
| none | hard | Wrong | 
| none | none | Right | 
| none | soft | Wrong | 
| none | none | Right | 
| none | hard | Wrong | 
| none | none | Right | 
| none | soft | Wrong | 
In this case, our “always predict none” model got every other instance correct. More precisely, it got 5 out of 10 correct, and the accuracy is the fraction of instances correct. The accuracy is 5/10 = 0.5 = 50% for this model. If we decide during our data mining process that accuracy is the appropriate metric for this dataset, then we may use this model in production if it’s the most accurate one we ever build. On the other hand, if we find a model that’s more accurate, we’d likely prefer the other model instead.
That’s almost all there is to accuracy. Build a model and ask it to make predictions on data for which you know the answer. The fraction of predictions you get correct is the model’s accuracy. The only “gotcha” is that you don’t want to use the same data you built the model with to evaluate it. You should always use different data. We’ve seen that the data use to build the model is called training data, so this gotcha means “Don’t ever use the training data to evaluate a model”.
Never use the training data to evaluate the model!!!
The reason to avoid using training data for model evaluation is that the model was build using the training data. It’s “seen” those instances before and saw the classes, the “right answers” for those instances. It may not always get those instances correct because data is just in general complicated, but it may get those instance correct more often than it does for instances it’s never seen before. Said another way, the accuracy estimate you get if you use the training data to determine it is biased. It’s likely to be higher than if you find some other data to use. For this reason, it’s common to split the data into two different parts before you build a model. One part is the training data we already know about while the other part is called testing data. It’s the subset of the data that can be used to evaluate the model.
Accuracy of random guessing
There are three classes in the contact lenses dataset. We could imagine building a model that just randomly guesses one of the three classes. In other words, for any instance, there would be a 1/3 probability the model predicts hard, 1/3 probability of soft, and 1/3 probability of none. For every instance, there’s thus a 1/3 probability the guess is correct. For the whole dataset used to evaluate such a model, random guessing would be 1/3 accurate in this case. In general, when there are N classes, random guessing gives you a model whose accuracy is 1/N. The best you can do is 50/50 for binary classification problems (problems where there are only two categories of the data). Note that our “always predict none” model above outperforms random guessing. It was 50% accurate vs the 33% accuracy we get from random guessing.
Zero Rule – a baseline accuracy
As described above, we can always build a bunch of models and pick the most accurate. But what if all the models are no good, and we’re just picking the best of a bad bunch. Is there any way we can no that, too? There is! If we’re using accuracy as our metric, we should always figure out the baseline accuracy. If we build a model with better accuracy, then we know it’s ok. If our model is worse than the baseline accuracy, then we need to keep trying and looking for a good model. Zero Rule (a.k.a. 0R or ZeroR) gives us that baseline accuracy.
If you’re using accuracy as a metric, always figure out the baseline accuracy for your dataset. Zero Rule gives us that baseline accuracy.
The Zero Model boils down to “always predict the majority class”. The majority class is the one with the most instances in the dataset. In the abbreviate contact lenses dataset above, we saw that none is the majority class. It had 5 instances while the other two classes only had 3 and 2 instances each. That “always predict none” model we used above is the Zero Rule model for that dataset! So we can come up with the 0R model just by looking at the data! Figure out the majority class and always guess that. If you use such a model to predict, that will give you a baseline accuracy. The fraction of instances in the majority class are the ones 0R would get correct, and that’s your baseline accuracy. For the small contact lenses dataset, always predicting none gave us 50% accuracy. That’s our baseline accuracy here. If we build a classification model that’s better than 50% accurate, it’s a “good” model because it’s above the baseline. If we build one that is less than 50% accurate (for example by random guessing), then we should discard that model because we can do better with Zero Rule.
Note that Zero Rule only uses the class feature to figure out which class it always predicts. It doesn’t use any of the information in the rest of the dataset. This quirk of Zero Rule also gives it the name of the “no-information model” or “no-information classifier“. No matter what we call it, it’s still the baseline.
You may rightly wonder what we would do with Zero Rule when there’s not a majority class. For instance, in the Iris dataset there are 50 instance of each of the three classes. It’s a tie! In general in data mining, we don’t want to arbitrarily break ties because that may bias the model to one class or another. So as a result we would make a random choice of one of the three classes and designate that randomly-chosen class as the one 0R always predicts.
Finally, let’s relate Zero Rule to class probabilities. We’ve seen how to estimate class probabilities from a dataset before, so let’s do that. For each class, we count how many instances there are in that class and divide by the total number of instances to get the fraction of the dataset that belongs to that class. This fraction is our estimate of the class probability.
| Class | Number of Instances | Estimated probability | 
| hard | 2 | 2/10 or 0.2 or 20% | 
| soft | 3 | 3/10 or 0.3 or 30% | 
| none | 5 | 5/10 or 0.5 or 50% | 
Notice that the majority class is the most probable class. The baseline accuracy is the class probability. This gives us a more mathematical or formal way to describe 0R. We always predict the most probable class, randomly breaking the tie if two classes are equally probable. When accuracy is our chosen metric, the baseline accuracy of our classification models is the probability or likelihood of the majority class.