标签归档:Mutation

样本突变频谱分析和基于突变频谱的热图聚类

对突变频谱(Mutation spectrum)进行分析,可以得知样本各种类型突变(如C:G>T:A)的数量及样本是否具有某种类型突变的偏好性。

单个碱基的替换一共有六种变异类型:C>A/G>T,C>G/G>C,C>T/G>A,T>A/A>T,T>C/A>G,T>G/A>C,有可以根据发生替换的碱基类别分为两大类:颠换transversion,嘌呤与嘧啶之间的替换,转换transition,嘌呤与嘌呤或嘧啶与嘧啶之间的替换。根据各类型点突变的比例,对样本和点突变类型进行聚类分析,可以研究样本点突变类型的偏好和各样本在点突变水平上的相似程度。

1,突变频谱数据准备

从VCF或者Annovar注释之后文件中,提取突变类型,第一列为样本名,第二列为突变类型,命名为type.txt。有来自来个实验的10个样本,分别为S1–S7和CHG。

Name    Type
S5      T>C/A>G
S5      T>C/A>G
S5      C>T/G>A
S5      T>A/A>T
S5      T>C/A>G
S1      C>T/G>A
S1      T>C/A>G
S1      C>T/G>A
S1      T>C/A>G

2,生成突变频谱图

library(ggplot2)
data1<- read.table("/path/to/type.txt",sep='t',header=T,check=F)
f1<- ggplot(data = data1, aes(x = Name, fill = Type)) + geom_bar(position = "fill") + labs(title = "Mutation Spectrum",x = "",y = "Fraction of Mutations") + theme(panel.background = element_blank(), axis.text.x  = element_text(angle=90), text = element_text(size=16) ) 
svg(file="mutation_spectrum.svg")
f1
dev.off()

继续阅读

预测突变危害的SIFT和PolyPhen2数据库介绍–SIFT and polyphen2 introduction

SIFT分数介绍

SIFT sorts intolerant from tolerant amino acid substitutions :通过寻找近似的序列,进行比对,计算发生碱基替换的概率,小于0.05被认为是有害的。

SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. SIFT is a multistep procedure that (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function to the query sequence , (3) obtains the alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions from the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.

PolyPhen-2介绍

考虑结构域,三维结构,通过机器学习然后对突变风险进行预测,计算FPR,。共两组数据集,第一组HumDiv是所有的已知和疾病有关的突变,及所在蛋白在人类和哺乳动物之间的差别,第二组HumVar是已知和疾病有关的突变作为有害和common SNP作为无害共同组成。
HumVar预测具有强烈危害的突变,HumDiv预测可能参与复杂疾病的稀有allele,在HumDiv中,具有中等危害的allele也被归为damaging。
probably damaging的区间值为0.909-1,possibly damaging为0.447-0.908,benign为0-0.446。

Two pairs of datasets were used to train and test PolyPhen-2 prediction models. The first pair, HumDiv, was compiled from all damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProtKB database, together with differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging. The second pair, HumVar, consisted of all human disease-causing mutations from UniProtKB, together with common human nsSNPs (MAF>1%) without annotated involvement in disease, which were treated as non-damaging.

The user can choose between HumDiv- and HumVar-trained PolyPhen-2 models. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained model should be used for this task. In contrast, HumDiv-trained model should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.

The authors recommend calling “probably damaging” if the score is between 0.909 and 1, and “possibly damaging” if the score is between 0.447 and 0.908, and “benign” is the score is between 0 and 0.446.(from annovar)

http://genetics.bwh.harvard.edu/pph2/dokuwiki/overview
http://sift.jcvi.org/www/SIFT_help.html