随机森林训练集样例不平衡

最近在用RF的时候，有一个很明显的现象就是训练集样本标签不平衡，有的很多有的很少，导致做预测的时候，预测的标签倾向于在训练集中占多数的标签。究其原因，RF最不平衡的分类训练集非常敏感。

在R中的randomForest可以尝试

1
2


k = min(table(Y))/2
rf = randomForest(X, Y, sampsize=c(k,k), replace=FALSE)

用这方方法，在我的数据中，效果不错。

sampsize：Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

This is one way to deal with imbalanced data in RF by using balanced data sets for each tree, even if the data is not balanced, which is made possible by bootstrap.

参考： https://www.quora.com/How-should-I-handle-unbalanced-data-while-using-randomForest-in-R

####################################################################

#Author: Jason

#####################################################################

文章目录