I’ll get to the model for the NBA data in a minute. First of all please allow me a small rant. Sometimes the simple things are the best. I have found this to be true with data mining models. Some might ask, why build a linear model, when it is so much more sexy to build a neural network or a support vector machine? My reply to this question is to pose a question in return: the simple things are easier to explain and if there is equivalent performance why add unneeded complexity?
I really like Data Miner Recipes in STATISTICA Data Miner. It provides the most satisfying automated data mining experience I have found among commercial or free data mining tools. Unfortunately linear models are not included in Data Miner Recipes. I understand that linear models aren’t appropriate for large data sets where everything becomes significant, but it would still be nice to have it as a option that could be checked or unchecked according to the situation. If this option existed, it would make it real slick to quickly compare the results for a neural network versus a linear model for a small data mining project. To me the results of a linear model should be the baseline of performance for the non-linear models. If the linear model outperforms these other models, then there is no question which option to select. On the other hand the non-linear model needs to be a clear winner to justify selecting it over the linear model. I feel justified in making this statement because of the added burden on the data miner to explain what is going on inside the black box of the non-linear models. So there is my rant for the day.
I followed steps to build a linear model essentially laid out by Data Miner Recipes. I started by checking for correlations between the input variables. I found that TS% and EFF FG% were highly correlated with OFF EFF. I consequently removed TS% and EFF FG% from the analysis. I did not do feature selection because I intended to use Best Subsets or Backwards Selection for the linear model. The rule of thumb I use is there must be at least 5 cases to estimate each effect in the model. This means at most there should be 5 to 6 input variables included in the NBA team winning percentage model. Even with removing TS% and EFF FG% there are not enough cases to predict all the first order effects for the remaining input variables.
From advanced models I selected General Regression and then General Linear Models. This allowed me to select Backwards Stepwise under the options tab. I found in this analysis that the reduced model was the same whether using Backwards Stepwise or Best Subsets. Only OFF EFF and DEF EFF were significant in the reduced model in predicting the winning percentage. This makes sense. Teams that are the best defensively and offensively should win the most. This simple model overall is doing a pretty good job of predicting winning percentage (Table 1).
The observed versus predicted values graphically demonstrate the high R^2 value recorded in Table 1 (see Figure 1).
For comparison I fit a neural network that got a comparable R^2 value, but the observed versus predicted graph seemed to indicate over-fitting. I also used the feature selection tool after removing the correlated variables and DEF EFF and OFF EFF had the highest F-values. This was a nice confirmation that the model I had arrived at with the linear model was a good fit.
What did you come up with for a model? Do you get a comparable R^2 value? Did you check for over-fitting? Do you agree that linear models are underutilized for data mining or do you feel like my rant is off base?
I would be curious to hear your feedback. Please post your comments or questions below. Have a great week. I hope to post next weekend. I am planning a comparison between the new version of Rapid Miner and STATISTICA using a big data problem.