1. Boruta
2. Variable Importance from Machine Learning Algorithms
3. Lasso Regression
4. Step wise Forward and Backward Selection
5. Relative Importance from Linear Regression
6. Recursive Feature Elimination (RFE)
7. Genetic Algorithm
8. Simulated Annealing
9. Information Value and Weights of Evidence
10. DALEX Package

## Introduction

 1 2 3 4 5 6  # Load Packages and prepare dataset library(TH.data) library(caret) data("GlaucomaM", package = "TH.data") trainData <- GlaucomaM head(trainData) 

Glaucoma数据集

## 1. Boruta

Boruta 是一个基于随机森林对特征进行排名和筛选的算法。

Boruta 的优势是它可以明确的决定变量是否重要，并且帮助选择那些统计显著的变量。此外，可以通过调整p值和maxRuns来调整算法的严格性。

maxRuns是算法运行的次数，该参数值越大，会选择出更多的变量。默认是100.

 1 2  # install.packages('Boruta') library(Boruta) 

Boruta的方程用的公式同其他预测模型类似，响应变量response在左边，预测变量在右边。

doTrace参数控制打印到终端的数目。该值越高，打印的log信息越多。为了节省空间，这里设置0，你可以设置1和2试试。

 1 2  # Perform Boruta search boruta_output <- Boruta(Class ~ ., data=na.omit(trainData), doTrace=0) 

  1 2 3 4 5 6 7 8 9 10 11  names(boruta_output) 1. ‘finalDecision’ 2. ‘ImpHistory’ 3. ‘pValue’ 4. ‘maxRuns’ 5. ‘light’ 6. ‘mcAdj’ 7. ‘timeTaken’ 8. ‘roughfixed’ 9. ‘call’ 10. ‘impSource’ 
 1 2 3 4 5 6 7  # 得到显著的或者潜在的变量 boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE) print(boruta_signif) [1] "as" "ean" "abrg" "abrs" "abrn" "abri" "hic" "mhcg" "mhcn" "mhci" [11] "phcg" "phcn" "phci" "hvc" "vbss" "vbsn" "vbsi" "vasg" "vass" "vasi" [21] "vbrg" "vbrs" "vbrn" "vbri" "varg" "vart" "vars" "varn" "vari" "mdn" [31] "tmg" "tmt" "tms" "tmn" "tmi" "rnf" "mdic" "emd" 

 1 2 3 4 5 6 7  # Do a tentative rough fix roughFixMod <- TentativeRoughFix(boruta_output) boruta_signif <- getSelectedAttributes(roughFixMod) print(boruta_signif) [1] "abrg" "abrs" "abrn" "abri" "hic" "mhcg" "mhcn" "mhci" "phcg" "phcn" [11] "phci" "hvc" "vbsn" "vbsi" "vasg" "vbrg" "vbrs" "vbrn" "vbri" "varg" [21] "vart" "vars" "varn" "vari" "tmg" "tms" "tmi" "rnf" "mdic" "emd" 

 1 2 3 4  # Variable Importance Scores imps <- attStats(roughFixMod) imps2 = imps[imps$decision != 'Rejected', c('meanImp', 'decision')] head(imps2[order(-imps2$meanImp), ]) # descending sort 
meanImp decision
varg 10.279747 Confirmed
vari 10.245936 Confirmed
tmi 9.067300 Confirmed
vars 8.690654 Confirmed
hic 8.324252 Confirmed
varn 7.327045 Confirmed

 1 2  # Plot variable importance plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance") 

Variable Importance Boruta

## 2. Variable Importance from Machine Learning Algorithms

1. 利用caret 包中的train()一个特定的模型

2. 使用 varImp() 来决定变量的重要性。

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  # 训练rpart模型，计算变量重要性 library(caret) set.seed(100) rPartMod <- train(Class ~ ., data=trainData, method="rpart") rpartImp <- varImp(rPartMod) print(rpartImp) rpart variable importance only 20 most important variables shown (out of 62) Overall varg 100.00 vari 93.19 vars 85.20 varn 76.86 tmi 72.31 vbss 0.00 eai 0.00 tmg 0.00 tmt 0.00 vbst 0.00 vasg 0.00 at 0.00 abrg 0.00 vbsg 0.00 eag 0.00 phcs 0.00 abrs 0.00 mdic 0.00 abrt 0.00 ean 0.00 

rpart仅使用了63个功能中的5个，如果仔细观察，这5个变量位于boruta选择的前6个中。

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  # Train an RRF model and compute variable importance. set.seed(100) rrfMod <- train(Class ~ ., data=trainData, method="RRF") rrfImp <- varImp(rrfMod, scale=F) rrfImp RRF variable importance only 20 most important variables shown (out of 62) Overall varg 24.0013 vari 18.5349 vars 6.0483 tmi 3.8699 hic 3.3926 mhci 3.1856 mhcg 3.0383 mv 2.1570 hvc 2.1357 phci 1.8830 vasg 1.8570 tms 1.5705 phcn 1.4475 phct 1.4473 vass 1.3097 tmt 1.2485 phcg 1.1992 mdn 1.1737 tmg 1.0988 abrs 0.9537 plot(rrfImp, top = 20, main='Variable Importance') 

ada, AdaBag, AdaBoost.M1, adaboost, bagEarth, bagEarthGCV, bagFDA, bagFDAGCV, bartMachine, blasso, BstLm, bstSm, C5.0, C5.0Cost, C5.0Rules, C5.0Tree, cforest, chaid, ctree, ctree2, cubist, deepboost, earth, enet, evtree, extraTrees, fda, gamboost, gbm_h2o, gbm, gcvEarth, glmnet_h2o, glmnet, glmStepAIC, J48, JRip, lars, lars2, lasso, LMT, LogitBoost, M5, M5Rules, msaenet, nodeHarvest, OneR, ordinalNet, ORFlog, ORFpls, ORFridge, ORFsvm, pam, parRF, PART, penalized, PenalizedLDA, qrf, ranger, Rborist, relaxo, rf, rFerns, rfRules, rotationForest, rotationForestCp, rpart, rpart1SE, rpart2, rpartCost, rpartScore, rqlasso, rqnc, RRF, RRFglobal, sdwd, smda, sparseLDA, spikeslab, wsrf, xgbLinear, xgbTree.

## 3. Lasso Regression

LASSO回归也可以被视为有效的变量选择技术。

  1 2 3 4 5 6 7 8 9 10 11 12  library(glmnet) trainData <- read.csv('https://raw.githubusercontent.com/selva86/datasets/master/GlaucomaM.csv') x <- as.matrix(trainData[,-63]) # all X vars y <- as.double(as.matrix(ifelse(trainData[, 63]=='normal', 0, 1))) # Only Class # Fit the LASSO model (Lasso: Alpha = 1) set.seed(100) cv.lasso <- cv.glmnet(x, y, family='binomial', alpha=1, parallel=TRUE, standardize=TRUE, type.measure='auc') # Results plot(cv.lasso) 

LASSO变量重要性

X轴是log之后的lambda，当是2的时候，lambda的真实值是100。

 1 2 3 4  bootsub <- boot.relimp(ozone_reading ~ Temperature_Sandburg + Humidity + Temperature_ElMonte + Month + pressure_height + Inversion_base_height, data=trainData, b = 1000, type = 'lmg', rank = TRUE, diff = TRUE) plot(booteval.relimp(bootsub, level=.95)) 

## 6. Recursive Feature Elimination (RFE)

• sizes

• rfeControl

sizes决定了rfe应该迭代的最重要变量的数目（我们期望的重要变量数目）。 在下面，sizes设置为1到5、10、15和18。

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47  str(trainData) 'data.frame': 366 obs. of 13 variables: $Month : int 1 1 1 1 1 1 1 1 1 1 ...$ Day_of_month : int 1 2 3 4 5 6 7 8 9 10 ... $Day_of_week : int 4 5 6 7 1 2 3 4 5 6 ...$ ozone_reading : num 3 3 3 5 5 6 4 4 6 7 ... $pressure_height : num 5480 5660 5710 5700 5760 5720 5790 5790 5700 5700 ...$ Wind_speed : int 8 6 4 3 3 4 6 3 3 3 ... $Humidity : num 20 41 28 37 51 ...$ Temperature_Sandburg : num 40.5 38 40 45 54 ... $Temperature_ElMonte : num 39.8 46.7 49.5 52.3 45.3 ...$ Inversion_base_height: num 5000 4109 2693 590 1450 ... $Pressure_gradient : num -15 -14 -25 -24 25 15 -33 -28 23 -2 ...$ Inversion_temperature: num 30.6 48 47.7 55 57 ... $Visibility : int 200 300 250 100 60 60 100 250 120 120 ... set.seed(100) options(warn=-1) subsets <- c(1:5, 10, 15, 18) ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", repeats = 5, verbose = FALSE) lmProfile <- rfe(x=trainData[, c(1:3, 5:13)], y=trainData$ozone_reading, sizes = subsets, rfeControl = ctrl) lmProfile Recursive feature selection Outer resampling method: Cross-Validated (10 fold, repeated 5 times) Resampling performance over subset size: Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected 1 5.222 0.5794 4.008 0.9757 0.15034 0.7879 2 3.971 0.7518 3.067 0.4614 0.07149 0.3276 3 3.944 0.7553 3.054 0.4675 0.06523 0.3708 4 3.924 0.7583 3.026 0.5132 0.06640 0.4163 5 3.880 0.7633 2.950 0.5525 0.07021 0.4334 10 3.751 0.7796 2.853 0.5550 0.06791 0.4457 * 12 3.767 0.7779 2.869 0.5511 0.06664 0.4424 The top 5 variables (out of 10): Temperature_ElMonte, Pressure_gradient, Temperature_Sandburg, Inversion_temperature, Humidity 

## 8. Simulated Annealing

safsControl类似于插入caret中的其他控制功能（例如rfe和ga中）。此外，它接受一个“ improve”参数，该参数是它需要等待而没有改进的迭代次数，直到将值重置为先前的迭代为止。

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44  # Define control function sa_ctrl <- safsControl(functions = rfSA, method = "repeatedcv", repeats = 3, improve = 5) # n iterations without improvement before a reset # Genetic Algorithm feature selection set.seed(100) sa_obj <- safs(x=trainData[, c(1:3, 5:13)], y=trainData[, 4], safsControl = sa_ctrl) sa_obj Simulated Annealing Feature Selection 366 samples 12 predictors Maximum search iterations: 10 Restart after 5 iterations without improvement (0.2 restarts on average) Internal performance values: RMSE, Rsquared Subset selection driven to minimize internal RMSE External performance values: RMSE, Rsquared, MAE Best iteration chose by minimizing external RMSE External resampling method: Cross-Validated (10 fold, repeated 3 times) During resampling: * the top 5 selected variables (out of a possible 12): Temperature_ElMonte (73.3%), Inversion_temperature (63.3%), Month (60%), Day_of_week (50%), Inversion_base_height (50%) * on average, 6 variables were selected (min = 3, max = 8) In the final search using the entire training set: * 6 features selected at iteration 10 including: Month, Day_of_month, Day_of_week, Wind_speed, Temperature_ElMonte ... * external performance at this iteration is RMSE Rsquared MAE 4.0574 0.7382 3.0727 # Optimal variables print(sa_obj$optVariables) [1] "Month" "Day_of_month" "Day_of_week" [4] "Wind_speed" "Temperature_ElMonte" "Visibility"  ## 9. Information Value and Weights of Evidence 在响应变量Y是二分类的时候，信息值Information Value可以用来判断预测变量（为分类变量）的重要性。这在逻辑回归和其他二分类的模型模型中表现的很好。 下面基于adult.csv数据集，来尝试找出哪些分类变量在预测个人是否赚取5万美元时起重要性。 运行下面的代码即可导入数据集。   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  library(InformationValue) inputData <- read.csv("http://rstatistics.net/wp-content/uploads/2015/09/adult.csv") print(head(inputData)) AGE WORKCLASS FNLWGT EDUCATION EDUCATIONNUM MARITALSTATUS 1 39 State-gov 77516 Bachelors 13 Never-married 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse 3 38 Private 215646 HS-grad 9 Divorced 4 53 Private 234721 11th 7 Married-civ-spouse 5 28 Private 338409 Bachelors 13 Married-civ-spouse 6 37 Private 284582 Masters 14 Married-civ-spouse OCCUPATION RELATIONSHIP RACE SEX CAPITALGAIN CAPITALLOSS 1 Adm-clerical Not-in-family White Male 2174 0 2 Exec-managerial Husband White Male 0 0 3 Handlers-cleaners Not-in-family White Male 0 0 4 Handlers-cleaners Husband Black Male 0 0 5 Prof-specialty Wife Black Female 0 0 6 Exec-managerial Wife White Female 0 0 HOURSPERWEEK NATIVECOUNTRY ABOVE50K 1 40 United-States 0 2 13 United-States 0 3 40 United-States 0 4 40 United-States 0 5 40 Cuba 0 6 40 United-States 0  查看inputData中分类变量的信息值。 好了，现在让我们在inputData中找到分类变量的信息值。   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  # 选择分类变量，用来计算信息值Info Value. cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables # Init Output df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe # 得到每个变量的信息值 for (factor_var in factor_vars){ df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K) df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K), "howgood") } # Sort df_iv <- df_iv[order(-df_iv$IV), ] df_iv 
VARS IV STRENGTH
5 RELATIONSHIP 1.53560810 Highly Predictive
3 MARITALSTATUS 1.33882907 Highly Predictive
4 OCCUPATION 0.77622839 Highly Predictive
2 EDUCATION 0.74105372 Highly Predictive
7 SEX 0.30328938 Highly Predictive
1 WORKCLASS 0.16338802 Highly Predictive
8 NATIVECOUNTRY 0.07939344 Somewhat Predictive
6 RACE 0.06929987 Somewhat Predictive

• 小于0.02，则预测变量对建模无用

• 0.02至0.1，则预测变量仅具有弱关系

• 0.1到0.3，则预测变量具有中等强度关系

• 0.3或更高，则预测变量具有很强的关系

IV=(perc good of all goods - perc bad of all bads) * WOE

Dalex变量重要性