“Essentially, all models are wrong, but some are useful.” – George E. P. Box

A reason why factor models are popular is that they are easy to interpret. A simple equation like y=Ax+B can clearly illustrate the relation between the y and x.

Having played around with Machine Learning for a while. Quite a number of times, I feel like I have to trade interpretability for accuracy.

Let’s start with ensemble method. It is a very common technique to boost accuracy as well as lower the error rate. That is, to create a better model by combining multiple different models. For example, I can create a few different regression classifiers and combine them to create a meta-classifier using majority voting. Although each single regression classifier is easy to understand, the combined meta-classifier is not. The relation between the predicted outcome and the input variables is hard to see.

This is the model that I used in solving a text classification problem. The figure shows the average ROC AUC scores when combining a Random Forest model and a Logistic Regression model under different weights. The combined model performs better than any single model alone.

Quite a number of machine learning models are “black-boxes”. One of my favorites is Random Forest. It is a rather “plug-&-play” model and insensitive to parameter tuning. Simply plug in the data and you will get a predictor with rather good accuracy. More important, it can tell you the estimated accuracy as well as the importance of each input variable. These neat features make it an excellent tool for exploring the data.

While the model is very easy to use, it is difficult to understand its prediction mechanism. Random Forest itself is a combination of hundreds of randomly trained decision trees. When making a prediction, each tree makes a vote and the majority wins. It is something that you cannot describe using a simple equation like those regression models.

Given the computational power these days, the cost of combining multiple prediction models is becoming less and less. It is not unlikely that a simple combination of models may easily outperform a single elegantly derived mathematical model in terms of accuracy. The winners of the Netflix Prize are using ensemble method heavily [1]. The future trend maybe, as the blog Overkill Analytics said, “Quantity over quality, CPU over IQ”.

In Science tradition, we believe that the model for making prediction and the model for making description are refer to the same thing. We use the same Newton’s laws to understand the nature and to calculate the acceleration of moving object. However, as the “black-boxes” models are doing a better and better job in making prediction, we can see these two kinds of models will split apart: “white-box” models in the classroom and “black-box” models in practice.

Finally, I want to put an interesting paper here as a further reading — *Statistical Modeling: The Two Cultures* [2]. It is somehow related to the topic and an enjoyable read =]

[1] http://www.netflixprize.com/

[2] Breiman L: Statistical modeling: the two cultures (with discussion). Statistical Science 2001, 16:199-231.