最近在用RF的时候,有一个很明显的现象就是训练集样本标签不平衡,有的很多有的很少,导致做预测的时候,预测的标签倾向于在训练集中占多数的标签。究其原因,RF最不平衡的分类训练集非常敏感。

在R中的randomForest可以尝试

1
2
k = min(table(Y))/2
rf = randomForest(X, Y, sampsize=c(k,k), replace=FALSE)

用这方方法,在我的数据中,效果不错。

sampsize:Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

This is one way to deal with imbalanced data in RF by using balanced data sets for each tree, even if the data is not balanced, which is made possible by bootstrap.

参考: https://www.quora.com/How-should-I-handle-unbalanced-data-while-using-randomForest-in-R

####################################################################

#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任

#Author: Jason

#####################################################################