Predictive Analytics: Evaluating Model Performance
Best guess with no modelFirst of all, we need to understand the goal of our evaluation. Are we trying to pick the best model ? Are we trying to quantify the improvement of each model ? Regardless of our goal, I found it is always useful to think about what the baseline should be. Usually the baseline is what is your best guess if you don't have a model.
For classification problem, one approach is to do a random guess (with uniform probability) but a better approach is to guess the output class that has the largest proportion in the training samples. For regression problem, the best guess will be the mean of output of training samples.
Prepare data for evaluationIn a typical setting, the set of data is divided into 3 disjoint groups; 20% data is set aside as testing data to evaluate the model we've trained. The remaining 80% of data is dividing into k partitions. k-1 partitions will be used as training data to train a model with a particular parameter value setting and 1 partition will be used as cross-validation data to pick the best parameter value that minimize the error of the cross-validation data.
As a concrete example, lets say we have 100 records available. We'll set aside 20% which is 20 records for testing purposes and use the remaining 80 records to train the model. Lets say the model has some tunable parameters (e.g. k in KNN, λ in linear model regularization). For each particular parameter value, we'll conduct 10 rounds of training (ie: k = 10). Within each round, we randomly select 90% which is 72 records to train a model and compute the error of prediction against the 8 unselected records. Then we take the average error of these 10 rounds and pick the optimal parameter value that gives the minimal average error. After picking the optimal tuning parameter, we retrain the model using the whole 80 records.
To evaluate the predictive performance of the model, we'll test it against the 20 testing records we set aside at the beginning. The details will be described below.
Measuring Regression PerformanceFor regression problem, measuring the distance between the estimated output from the actual output is used to quantify the model's performance. Three measures are commonly used: Root Mean Square Error, Relative Square Error and Coefficient of Determination. Typically root mean square is used for measuring the absolute quantity of accuracy.
Mean Square Error MSE = (1/N) * ∑(yest – yactual)2
Root Mean Square Error RMSE = (MSE)1/2
To measure the accuracy with respect to the baseline, we use the ratio of MSE
Relative Square Error RSE = MSE / MSEbaseline
RSE = ∑(yest – yactual)2 / ∑(ymean – yactual)2
Coefficient Of Determination (also called R square) measures the variance that is explained by the model, which is the reduction of variance when using the model. R square ranges from 0 to 1 while the model has strong predictive power when it is close to 1 and is not explaining anything when it is close to 0.
R2 = (MSEbaseline – MSE) / MSEbaseline
R2 = 1 – RSE
Here are some R code to compute these measures
> > Prestige_clean <- Prestige[!is.na(Prestige$type),] > model <- lm(prestige~., data=Prestige_clean) > score <- predict(model, newdata=Prestige_clean) > actual <- Prestige_clean$prestige > rmse <- (mean((score - actual)^2))^0.5 > rmse  6.780719 > mu <- mean(actual) > rse <- mean((score - actual)^2) / mean((mu - actual)^2) > rse  0.1589543 > rsquare <- 1 - rse > rsquare  0.8410457 >
The Mean Square Error penalize the bigger difference more because of the
square effect. On the other hand, if we want to reduce the penalty of
bigger difference, we can log transform the numeric quantity first.
Root Mean Square Log Error RMSLE = (MSLE)1/2
Measuring Classification PerformanceFor classification problem, there are a couple of measures.
- TP = Predict +ve when Actual +ve
- TN = Predict -ve when Actual -ve
- FP = Predict +ve when Actual -ve
- FN = Predict -ve when Actual +ve
Precision = TP / Predict +ve = TP / (TP + FP)
Recall or Sensitivity = TP / Actual +ve = TP / (TP + FN)
Specificity = TN / Actual -ve = TN / (FP + TN)
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy alone is not sufficient to represent the quality of prediction because the cost of making a FP may be different from the cost of making a FN. F measures provides a tunable assigned weight for computing a final score and is commonly used to measure the quality of a classification model.
1/Fmeasure = α/recall + (1-α)/precision
Notice that most classification model is based on estimating a numeric score for each output class. By choosing the cutoff point of this score, we can control the tradeoff between precision and recall. We can plot the relationship between precision and recall at various cutoff points as follows ...
> library(ROCR) > library(e1071) > nb_model <- naiveBayes(Species~., data=iristrain) > nb_prediction <- predict(nb_model, iristest[,-5], type='raw') > score <- nb_prediction[, c("virginica")] > actual_class <- iristest$Species == 'virginica' > # Mix some noise to the score > # Make the score less precise for illustration > score <- (score + runif(length(score))) / 2 > pred <- prediction(score, actual_class) > perf <- performance(pred, "prec", "rec") > plot(perf) >
Another common plot is the ROC curve which plot the "sensitivity" (true positive rate) against 1 - "specificity" (false positive rate). The area under curve "auc" is used to compare the quality between different models with varying cutoff points. Here is how we produce the ROC curve.
> library(ROCR) > library(e1071) > nb_model <- naiveBayes(Species~., data=iristrain) > nb_prediction <- predict(nb_model, iristest[,-5], type='raw') > score <- nb_prediction[, c("virginica")] > actual_class <- iristest$Species == 'virginica' > # Mix some noise to the score > # Make the score less precise for illustration > score <- (score + runif(length(score))) / 2 > pred <- prediction(score, actual_class) > perf <- performance(pred, "tpr", "fpr") > auc <- performance(pred, "auc") > auc <- unlist(slot(auc, "y.values")) > plot(perf) > legend(0.6,0.3,c(c(paste('AUC is', auc)),"\n"), border="white",cex=1.0, box.col = "white") >
We can also assign the relative cost of making a false +ve and false -ve decision to find the best cutoff threshold. Here is how we plot the cost curve
> # Plot the cost curve to find the best cutoff > # Assign the cost for False +ve and False -ve > perf <- performance(pred, "cost", cost.fp=4, cost.fn=1) > plot(perf) >
From the curve, the best cutoff point 0.6 is where the cost is minimal.
Source of error: Bias and VarianceIn model-based machine learning, we are making assumption that the underlying data follows some underlying mathematical model and during the training we try to fit the training data into this assumed model and determine the best model parameters which gives the minimal error.
One source of error is when our assumed model is fundamentally wrong (e.g. if the output has a non-linear relationship with the input but we are assuming a linear model). This is known as the High Bias problem which we use an over-simplified model to represent the underlying data.
Another source of error is when the model parameters fits too specifically to the training data and not generalizing well to the underlying data pattern. This is known as the High Variance problem and usually happen when there is insufficient training data compare to the number of model parameters.
High bias problem has the symptom that both training and cross-validation shows a high error rate and both error rate drops as the model complexity increases. While the training error keep decreasing as the model complexity increases, the cross-validation error will increase after certain model complexity and this indicates the beginning of a high variance problem. When the size of training data is fixed and the only thing we can choose is the model complexity, then we should choose the model complexity at the point where the cross-validation error is minimal.
When collecting more data cost both time and money, we need to carefully
assess the situation before we spend our effort to do so. There is a
very pragmatic technique suggested by Andrew Ng from Stanford
by plotting the error against the size of data. In this approach, we
sample different size of training data to train up different models and
plot both the cross-validation error and the training error with respect
to the training sample size.
Will getting more data help ?
If the problem is a high-bias problem, the error curve will look like the following.
In this case, collecting more data would not help. We should spend our effort to do the following instead.
- Add more input features by talking to domain experts to identify more input signals and collect them.
- Add more input features by combining existing input features in a non-linear way.
- Try more complex, machine learning models. (e.g. increase the number of hidden layers in Neural Network, or increase the number of Neurons at each layer)
The only situation where having more data will be helpful is when the underlying data model is in fact complex. Therefore we cannot just reduce the complexity as this will immediately results in a high-bias problem. In this case, the error curve will have the following shape.
And the only way is to collect more training data such that overfitting is less likely to happen.
Evaluate the performance of a model is very important in the overall cycle of predictive analytics. Hopefully this introductory post gives a basic idea on how this can be done.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)