Data exploration using Random Forests

The Stumbleupon web page classification competition on Kaggle ended recently. With some luck, I got into the final top 10%. During the initial data exploration, I tried to derive a set of linguistic features from the text. This includes something like the ratios of nouns, adjectives, and adverbs in the web pages. In addition, I suspected that subjectivity might be important. So I added the ratios of positive and negative words in the feature set as well.

To see these linguistic features are useful or not, I plugged the data into a Random Forests. Random Forests is a collection of decision trees. Each tree is trained on a bootstrapped sample of the original data and is grown using a random subset of the input variables. In spite of the fact that it is a black box model, it is probably one of the best off-the-shelf classifiers with virtually no parameter tuning and good accuracy. In addition, as each tree is trained using different variables, the variable importance measure comes as a byproduct [1].

Feature importances measured by Random Forests

The estimated accuracy of the model is only 62%, which shows that the linguistic variables are not that useful. However, the above picture still surprised me. It showed that positive and negative word usages are not useful and my hypothesis was completely wrong. On the other hand, the ratio of nouns ranked top on the feature list and it was something that I didn’t think of.

[1] Feature importances with forests of trees,  scikit-learn 0.14 documentation

NoSQL Distilled: A Short Review


The NoSQL movement started a few years ago when Google and Amazon published papers (BigTable and Dynamo) about their efforts on handling extraordinary large amount of data. Since then, numerous organizations began to explore alternative data storage models that different from the traditional relational database model for better scalability or other requirements.

This book written by Pramod Sadalage and Martin Fowler is an introductory text in the current landscape of alternative non-relational databases. It is a quick read and is only about 150 pages.

The text covered 4 main NoSQL families: Key-value databases, Document Databases, Column-Family Stores and Graph Databases. The first half of the text is great. The differences among these 4 models and the concepts that common to these data stores, as well as sharding, replication and consistency issues, are well-written.

The second half of the book is “implementation“. There is one chapter for each NoSQL family. This part is very superficial and not really helpful.

Anyway, the book is a good start point for alternative database system. Although things like document databases are becoming popular among the web community and HBase is getting more attentions for those who need Petabytes of storage, more research is needed before getting inclined to one particular storage technology. For example, Cassandra looks good in many perspectives, but this article [1] points out that a doomsday unrecoverable scenario can happen if the local clock of any one of the cluster nodes goes wrong.

[1] The trouble with timestamps by Aphyr. A recommended great read.

Predicting the Random Walk

There is a recently ended stock price prediction competition on Kaggle. Although I haven’t joined it, the result is rather interesting. EMH (Efficient Market Hypothesis) believers should be happy. 427 teams participated in the competition, which probably contains some of the best data scientists in the world, cannot do anything much better than the Random Walk benchmark. The winning solution is only negligibly better than using the last observable value for prediction (a 0.0036 improvement in the mean absolute error).

The data set contains data collected by Deltix in a two-year period, including 200 days of training data and 310 days of testing data. For each trading day, the prices of 198 securities and the values of 244 undisclosed “indicators” are recorded in 5-minute intervals starting from 9:30AM to 1:55PM. The task is to predict the closing price for each security 2 hours later, at 4PM.

It is not quite right to say the competition result confirmed EMH. But given the data provided, it is very hard to make prediction. The features provided in the data are not disclosed. You don’t know what indicator I230 actually means and therefore you can’t use your prior knowledge to build the model; the indicators are also not security-specific. So things like “special news related to stock 1234 announced” are not included in the data set.

In addition, the snapshots for the indicator values are also ended at 1:55PM. So you cannot use the indicator values at 4PM to aid your prediction. One can imagine that the change in the market index during 1:55PM to 4:00PM could provide some value in making the prediction.

Nevertheless, the biggest lesson learnt in this competition is nothing related to finance. “Believe your Cross-validation score” as someone said. Many teams too relied on the “Public Leaderboard” score (calculated using 30% of the test data) when optimizing or choosing their models. These over-fitted models gave a very bad performance when making prediction at the end of the competition.

Accuracy over Interpretability

“Essentially, all models are wrong, but some are useful.” – George E. P. Box

A reason why factor models are popular is that they are easy to interpret. A simple equation like y=Ax+B can clearly illustrate the relation between the y and x.

Having played around with Machine Learning for a while. Quite a number of times, I feel like I have to trade interpretability for accuracy.

Let’s start with ensemble method. It is a very common technique to boost accuracy as well as lower the error rate. That is, to create a better model by combining multiple different models. For example, I can create a few different regression classifiers and combine them to create a meta-classifier using majority voting. Although each single regression classifier is easy to understand, the combined meta-classifier is not. The relation between the predicted outcome and the input variables is hard to see.

Combining a Random Forest model and a Logistic Regression Model

This is the model that I used in solving a text classification problem. The figure shows the average ROC AUC scores  when combining a Random Forest model and a Logistic Regression model under different weights. The combined model performs better than any single model alone.

Quite a number of machine learning models are “black-boxes”. One of my favorites is Random Forest. It is a rather “plug-&-play” model and insensitive to parameter tuning. Simply plug in the data and you will get a predictor with rather good accuracy. More important, it can tell you the estimated accuracy as well as the importance of each input variable. These neat features make it an excellent tool for exploring the data.

While the model is very easy to use, it is difficult to understand its prediction mechanism. Random Forest itself is a combination of hundreds of randomly trained decision trees. When making a prediction, each tree makes a vote and the majority wins. It is something that you cannot describe using a simple equation like those regression models.

Given the computational power these days, the cost of combining multiple prediction models is becoming less and less. It is not unlikely that a simple combination of models may easily outperform a single elegantly derived mathematical model in terms of accuracy. The winners of the Netflix Prize are using ensemble method heavily [1]. The future trend maybe, as the blog Overkill Analytics said, “Quantity over quality, CPU over IQ”.

In Science tradition, we believe that the model for making prediction and the model for making description are refer to the same thing. We use the same Newton’s laws to understand the nature and to calculate the acceleration of moving object. However, as the “black-boxes” models are doing a better and better job in making prediction, we can see these two kinds of models will split apart: “white-box” models in the classroom and “black-box” models in practice.

Finally, I want to put an interesting paper here as a further reading — Statistical Modeling: The Two Cultures [2]. It is somehow related to the topic and an enjoyable read =]

[2] Breiman L: Statistical modeling: the two cultures (with discussion). Statistical Science 2001, 16:199-231.