Simple clustering method for variable selection in multiple imputation datasets outperformed complex methods | BMC Medical Research Methodology


Principle results

In defining a prognostic model after MI with different types of variables, including categorical variables, it is important to use a global test to conclude whether a categorical variable is relevant to the model. In the present study, multiplied simulated data sets were used and four selection methods (D1, D2, D3 and MPR) for categorical, dichotomous and continuous variables in a logistic regression model with a BWS procedure were evaluated. The frequency of variable selection, the P values ​​of the selected variables, and the stability of the selected models were compared to the results of those in the full dataset. The performance of MPR is tested under many different conditions and variations and proven to be an easy-to-apply method that is consistently better (both in terms of pick frequency and in terms of P-values ​​and model stability) than others. methods of clustering into categorical variables in an MI context. For continuous and dichotomous variables, no consistent differences were found between the four clustering methods.

Comparison with literature

Eekhout et al. concluded that to obtain correct and powerful poolings P-values ​​for significance tests of categorical variables with the MPR, compared to methods D1, D2 and D3, the result should be omitted from the imputation model [1]. To obtain a powerful significance test for continuous and dichotomous variables with RR after MI, the MI procedure must include the outcome variable, as shown by Moons et al. [19]. We repeated our simulation study in datasets with a sample size of not= 500 and not= 2000 and an additional categorical variable with five categories under two different conditions: one included the outcome variable in the imputation model and the other excluded the outcome. We observed no differences in the median P values ​​of the selected predictor variables or in the stability of the selected models. Only the frequency of selection of the predictor variables was slightly higher when the outcome was included in the imputation model, but it was the same for all the grouping methods. The larger the data sets, the smaller the differences between the four clustering methods. We therefore conclude that for overall significance tests of categorical variables, the outcome variable can be included in the imputation model.

Heinze et al. and Wallisch et al. stated that variable selection can compromise the stability of a final model. This is an often overlooked problem of data-driven variable selection [4, 18]. Furthermore, Royston and Sauerbrei stated that the stability of the model must be proven because many different factors influence the stability of the selected models. [20, 21]. In our simulation study, we examined the stability of selected models in Multiply Imputed datasets by repeating each procedure 500 times. An interesting result is that the MPR clustering method resulted in more stable variable selection than the other clustering methods. This finding was also reflected in analyzes of the NHANES real-world dataset. Austin et al. and Wood et al. stated that variable selection in imputed multiplied datasets should be done from the pooled model using RR, which is easy to do for continuous and dichotomous variables, but less straightforward for categorical variables [22, 23]. We distinguished the selection of all types of variables and showed that the MPR method worked as well as RR for continuous and dichotomous variables and better than the D1, D2 and D3 methods for categorical variables. The ease of use of clustering methods depends on their availability in statistical software. Most software packages do not provide these methods in combination with variable selection and are therefore beyond the reach of applied researchers. The strength of the MPR rule is that it can be easily applied in any software package and it does not take much time.

Strengths and limitations

Our objective was to compare four different selection methods. A strength is that we applied two different ways to group and select variables: 1. Rubin’s rules (RR) were applied to group continuous and dichotomous variables and D1, D2, D3 and MPR grouping methods for categorical variables. 2. All variables were grouped using the D1, D2, D3 and MPR method.

No difference was found between these two ways of grouping and selecting variables, i.e. MPR outperformed all other methods. Another strength is that we used various p-out values ​​to assess the behavior of the clustering methods when the selected models contained variables with a strong or less strong relationship with the outcome, as can be found in normal practice. We found that in most scenarios, the MPR method resulted in the most stable models.

Also, a strength is that we, in addition to the study by White and Austin et al., have performed many different simulated conditions based on empirical data. We assessed the frequency of selection of variables, the P-the values ​​of the selected variables and the model stability of the selected models [22, 23]. Additionally, we added a noise variable to assess whether all methods handled this variable well. In most of these conditions, the MPR method was no worse than the other methods. A limitation could be that the simulation study used a smaller number of covariates than those used in the practical datasets. However, the NHANES dataset contained a mix of weaker and stronger variables, like in real-world datasets, and the results from the NHANES dataset confirmed what we saw in the study. simulation.

Another limitation could be that we only used two different correlation levels in our simulation sets (0.2 and 0.6). However, to set up our simulation study, we initially used the article by Wood, White and Royston [23] on variable selection methods in multiple imputation datasets, which came closest to the goal of our study. They reported a correlation of 0.62 and defined it as a high correlation value. We therefore used a high correlation of 0.6 in our study. We wanted to compare this high correlation with a lower correlation and used the value of 0.2. We believe that by using these values ​​for the correlation, we were able to test the methods in data sets commonly seen in medical studies containing variables with comparable lower and upper correlations. Another limitation may be that we used a fast backward selection procedure to select variables from the full datasets. [24]. It is known that this may not be the most effective method of selection [24, 25]. An alternative may be to use more advanced methods like the Least Absolute Withdrawal and Selection Operator (LASSO) [25]. However, the LASSO is developed for situations where the number of predictors is much higher than the number of people. This is not the case in many medical and epidemiological datasets. Another problem with LASSO estimation is its scale dependence on covariates. One solution is to apply the internal normalization in the LASSO software to the unit variance before variable selection. After that, the regression coefficients are then transformed back to the original scale. However, it is not yet clear whether standardization of “one-size-fits-all” variables is the best choice for all modeling purposes. Therefore, using the fast backward selection procedure was the best option to compare the pooled selection methods with a similar selection procedure in the full datasets. [4]. Another limitation could be that we considered all continuous variables as normally distributed while in practice there are also nonlinear relationships, so further research will be needed on the selection of such type of variables in datasets. with multiple imputations.


About Author

Comments are closed.