Multivariable regression models are an important tool in biological sciences, as they provide a simplified mathematical relation between multiple variables to determine factors affecting an outcome. When building a regression model, especially in large datasets with a large number of covariates, the variable selection process is a crucial step.
An ideal variable selection method for regression models would find one or more subsets of variables, which have an optimal prediction or explanation performance. Usually, this performance is not optimized during variable selection: an exhaustive test of all possible variable subsets is oftentimes not feasible and non-optimal empirical variable selection methods are being applied instead.
We propose a new algorithm for model selection, which combines forward variable selection and all-subsets regression, and which is referred to as "FARMS" ("Forward and All-subsets Regression for Model Selection"). We have implemented FARMS in R statistical software.
FARMS is a flexible method with additional features that allow tailoring the search of the best model to the experimental needs. For instance, forced-in covariates can be specified and the total number of covariates included in the final model can be fixed a priori. The best model can be selected by the AIC or the BIC value, and the best subset can be selected using different criteria: Mallows’ Cp, R-square, Adjusted R-square or the already mentioned AIC and BIC.
In order to explore the properties of this new method, including its robustness, we have run several tests varying FARMS parameters. We have also compared its results with the results provided by common used methods such as stepwise regression and all-subsets regression.
We have done these comparisons on a real dataset that includes host genetic and immunological information of over 800 HIV infected individuals from Lima (Peru) and Durban (South Africa). This dataset includes approximately 500 variables with information on HIV immune reactivity (around 400 predictive variables), individual genetic characteristics (around 80 predictive variables) and clinical data such as the plasma viral load.
The results obtained showed that FARMS is a very robust approach: the same model was always obtained after 400 executions varying its parameters. In addition, our new approach is also faster than common approaches: when selecting the best model using 400 covariates, FARMS needed less than half the time of stepwise approaches. Further improvements are currently added to the algorithm, including the evaluation of quadratic terms to allow for evaluation of non-linear relationships among the variable and adding parameters that allow the evaluation of the interaction terms.