预测模型校准曲线 Calibration curve

Calibration curve的横坐标是我们用模型预测的probability，比如我预测的是可能是肿瘤患者的概率，risk probability，纵坐标是真实的事件event的概率或者事件的proportion。

``````# rms包可以画calibration plot
library(rms)
val.prob(predicted.probability, TRUE.Label)

rms::calibrate的方法

#我自己也参考别人写了一个函数,结果如下图，欢迎使用
library(loonR)
riskCalibrationPlot(TRUE.Label, predicted.probability)
``````

#####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
#####################################################################

特征选择

1. Boruta
2. Variable Importance from Machine Learning Algorithms
3. Lasso Regression
4. Step wise Forward and Backward Selection
5. Relative Importance from Linear Regression
6. Recursive Feature Elimination (RFE)
7. Genetic Algorithm
8. Simulated Annealing
9. Information Value and Weights of Evidence
10. DALEX Package

Introduction

``````# Load Packages and prepare dataset
library(TH.data)
library(caret)
data("GlaucomaM", package = "TH.data")
trainData <- GlaucomaM
``````

Glaucoma数据集

1. Boruta

Boruta 是一个基于随机森林对特征进行排名和筛选的算法。

Boruta 的优势是它可以明确的决定变量是否重要，并且帮助选择那些统计显著的变量。此外，可以通过调整p值和maxRuns来调整算法的严格性。

maxRuns是算法运行的次数，该参数值越大，会选择出更多的变量。默认是100.

``````# install.packages('Boruta')
library(Boruta)
``````

Boruta的方程用的公式同其他预测模型类似，响应变量response在左边，预测变量在右边。

`doTrace`参数控制打印到终端的数目。该值越高，打印的log信息越多。为了节省空间，这里设置0，你可以设置1和2试试。

``````# Perform Boruta search
boruta_output <- Boruta(Class ~ ., data=na.omit(trainData), doTrace=0)
``````

``````names(boruta_output)
1. ‘finalDecision’
2. ‘ImpHistory’
3. ‘pValue’
4. ‘maxRuns’
5. ‘light’
7. ‘timeTaken’
8. ‘roughfixed’
9. ‘call’
10. ‘impSource’
``````
``````# 得到显著的或者潜在的变量
boruta_signif <- getSelectedAttributes(boruta_output, withTentative = TRUE)
print(boruta_signif)
[1] "as"   "ean"  "abrg" "abrs" "abrn" "abri" "hic"  "mhcg" "mhcn" "mhci"
[11] "phcg" "phcn" "phci" "hvc"  "vbss" "vbsn" "vbsi" "vasg" "vass" "vasi"
[21] "vbrg" "vbrs" "vbrn" "vbri" "varg" "vart" "vars" "varn" "vari" "mdn"
[31] "tmg"  "tmt"  "tms"  "tmn"  "tmi"  "rnf"  "mdic" "emd"
``````

``````# Do a tentative rough fix
roughFixMod <- TentativeRoughFix(boruta_output)
boruta_signif <- getSelectedAttributes(roughFixMod)
print(boruta_signif)
[1] "abrg" "abrs" "abrn" "abri" "hic"  "mhcg" "mhcn" "mhci" "phcg" "phcn"
[11] "phci" "hvc"  "vbsn" "vbsi" "vasg" "vbrg" "vbrs" "vbrn" "vbri" "varg"
[21] "vart" "vars" "varn" "vari" "tmg"  "tms"  "tmi"  "rnf"  "mdic" "emd"
``````

``````# Variable Importance Scores
imps <- attStats(roughFixMod)
imps2 = imps[imps\$decision != 'Rejected', c('meanImp', 'decision')]
``````
meanImp decision
varg 10.279747 Confirmed
vari 10.245936 Confirmed
tmi 9.067300 Confirmed
vars 8.690654 Confirmed
hic 8.324252 Confirmed
varn 7.327045 Confirmed

``````# Plot variable importance
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")
``````

Variable Importance Boruta

2. Variable Importance from Machine Learning Algorithms

1. 利用caret 包中的train()一个特定的模型
2. 使用 `varImp()` 来决定变量的重要性。

``````# 训练rpart模型，计算变量重要性
library(caret)
set.seed(100)
rPartMod <- train(Class ~ ., data=trainData, method="rpart")
rpartImp <- varImp(rPartMod)
print(rpartImp)
rpart variable importance

only 20 most important variables shown (out of 62)

Overall
varg  100.00
vari   93.19
vars   85.20
varn   76.86
tmi    72.31
vbss    0.00
eai     0.00
tmg     0.00
tmt     0.00
vbst    0.00
vasg    0.00
at      0.00
abrg    0.00
vbsg    0.00
eag     0.00
phcs    0.00
abrs    0.00
mdic    0.00
abrt    0.00
ean     0.00
``````

rpart仅使用了63个功能中的5个，如果仔细观察，这5个变量位于boruta选择的前6个中。

``````# Train an RRF model and compute variable importance.
set.seed(100)
rrfMod <- train(Class ~ ., data=trainData, method="RRF")
rrfImp <- varImp(rrfMod, scale=F)
rrfImp
RRF variable importance

only 20 most important variables shown (out of 62)

Overall
varg 24.0013
vari 18.5349
vars  6.0483
tmi   3.8699
hic   3.3926
mhci  3.1856
mhcg  3.0383
mv    2.1570
hvc   2.1357
phci  1.8830
vasg  1.8570
tms   1.5705
phcn  1.4475
phct  1.4473
vass  1.3097
tmt   1.2485
phcg  1.1992
mdn   1.1737
tmg   1.0988
abrs  0.9537
plot(rrfImp, top = 20, main='Variable Importance')
``````

ada, AdaBag, AdaBoost.M1, adaboost, bagEarth, bagEarthGCV, bagFDA, bagFDAGCV, bartMachine, blasso, BstLm, bstSm, C5.0, C5.0Cost, C5.0Rules, C5.0Tree, cforest, chaid, ctree, ctree2, cubist, deepboost, earth, enet, evtree, extraTrees, fda, gamboost, gbm_h2o, gbm, gcvEarth, glmnet_h2o, glmnet, glmStepAIC, J48, JRip, lars, lars2, lasso, LMT, LogitBoost, M5, M5Rules, msaenet, nodeHarvest, OneR, ordinalNet, ORFlog, ORFpls, ORFridge, ORFsvm, pam, parRF, PART, penalized, PenalizedLDA, qrf, ranger, Rborist, relaxo, rf, rFerns, rfRules, rotationForest, rotationForestCp, rpart, rpart1SE, rpart2, rpartCost, rpartScore, rqlasso, rqnc, RRF, RRFglobal, sdwd, smda, sparseLDA, spikeslab, wsrf, xgbLinear, xgbTree.

3. Lasso Regression

LASSO回归也可以被视为有效的变量选择技术。

``````library(glmnet)

x <- as.matrix(trainData[,-63]) # all X vars
y <- as.double(as.matrix(ifelse(trainData[, 63]=='normal', 0, 1))) # Only Class

# Fit the LASSO model (Lasso: Alpha = 1)
set.seed(100)
cv.lasso <- cv.glmnet(x, y, family='binomial', alpha=1, parallel=TRUE, standardize=TRUE, type.measure='auc')

# Results
plot(cv.lasso)
``````

LASSO变量重要性

X轴是log之后的lambda，当是2的时候，lambda的真实值是100。

``````# plot(cv.lasso\$glmnet.fit, xvar="lambda", label=TRUE)
cat('Min Lambda: ', cv.lasso\$lambda.min, '\n 1Sd Lambda: ', cv.lasso\$lambda.1se)
df_coef <- round(as.matrix(coef(cv.lasso, s=cv.lasso\$lambda.min)), 2)

# See all contributing variables
df_coef[df_coef[, 1] != 0, ]
Min Lambda:  0.01166507
1Sd Lambda:  0.2513163

Min Lambda:  0.01166507
1Sd Lambda:  0.2513163

(Intercept) 3.65
at         -0.17
as         -2.05
eat        -0.53
mhci        6.22
phcs       -0.83
phci        6.03
hvc        -4.15
vass       -23.72
vbrn       -0.26
vars       -25.86
mdt        -2.34
mds         0.5
mdn         0.83
mdi         0.3
tmg         0.01
tms         3.02
tmi         2.65
mv          4.94
``````

将ggplot导出成ppt的R包

1，export的graph2ppt函数

https://github.com/tomwenseleers/export

export虽然从CRAN下架了，但依然可以通过github的库来安装，devtools::install_github(“tomwenseleers/export”)

2，eoffice的topptx函数

https://github.com/guokai8/eoffice

3, officer

https://github.com/davidgohel/officer

Pacbio的工具更新实在是太快了，https://github.com/PacificBiosciences/pbbioconda

The contiguous sequence generated by the polymerase during sequencing is referred to as a “polymerase read” or a Continuous Long Read (CLR).

This CLR read may include sequence from adapters and multiple copies of inserts, because it traverses the circular template many times. The CLRs are processed to remove adapter sequences and to retain only the insert sequence, called “subreads”.

All other sequences sequenced from the CLR are called “scraps”.

Multiple copies of subreads generated from the single SMRTBell can then be collapsed to a single, high-quality sequence, called the “read of insert” ROI or Circular Consensus Sequence (CCS).

The scraps file includes all the sequence not in the HQ (high quality) region, and the adapter sequences. The old bax format saved everything with indexes into the usable data. The subreads.bam contains all the usable data.

.subreads.bam A file contain usable reads without adapter;The Sequel System outputs one subreads.bam file per collection/cell, which contains unaligned base calls from high-quality regions.
.scraps.bam A file contain sequence data outside of the High Quality region, rejected subreads,excised adapter and possible barcode sequences, as well as spike-in control sequences. (The basecaller marks regions of single molecule sequence activity as high-quality.)

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03751-8

z score

z score：这个指标指的是某个基因对missense的耐受程度，具体是指该基因所期望的missense数比上观察

REVEL score

REVEL score：ClinGen SVI建议使用REVEL用来预测missense致病性。与其他常用missense致病性预测软件不同，REVEL整合了包括SIFT、PolyPhen、GERP++在内的13个软件的预测结果，对罕见变异的预测结果更加出色。当REVEL score>0.75，<0.15时分别使用ACMG指南PP3和BP4。

GERP++

GERP++ rejected substitutions” (RS) score：GERP++从基因进化速率角度预测位点保守性，具体是指该基因位点所期望的碱基替换次数减去观察到的碱基替换次数，可见分数越大，该位点保守性较强，当GERP++ RS score>6.8时，认为该位点保守。当分析一个不影响剪切的同义突变时，如果RS score<6.8，则可以使用ACMG指南BP7。