预测突变危害的SIFT和PolyPhen2数据库介绍--SIFT and polyphen2 introduction

SIFT分数介绍

SIFT sorts intolerant from tolerant amino acid substitutions :通过寻找近似的序列,进行比对,计算发生碱基替换的概率,小于0.05被认为是有害的。

SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. SIFT is a multistep procedure that (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function to the query sequence , (3) obtains the alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions from the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.

SAMtools自带的统计命令--idxstats、stat、flagstat、bedcov和depth命令

SAMtools不仅仅用来call snp。从samtools的软件名就能看出,是对SAM格式文件进行操作的工作,比如讲sam转成bam格式,index,rmdup等等。samtools结合linux命令比如grep,awk和SAM格式描述的flag,tag,亦是非常非常非常强大,比如根据flag过滤duplicate的reads,根据XA tag过滤multiple hit的reads。本文在此只介绍一下samtools的统计命令,能快速对bam文件进行各种统计。

samtools的自带的几种统计工具

**samtool idxstats **

检索和打印与输入文件相对应的index file里的统计信息,所以要对输入的bam文件进行index

reference sequence name, sequence length, # mapped reads and # unmapped reads chr1 249250621 4998344 1005 chr2 243199373 3020248 595 chr3 198022430 2418804 449

samtools bedcov

计算覆盖到每个区域的总碱基数目

chr start-1 end totalbase chr1 100000 1000000 1709228 chr2 2000000 65885852 64362582

**samtools depth **

计算每个位点的深度

#chr pos depth chr1 1 5 chr1 2 5

samtools flagstat

根据flag统计多少map的reads等信息 43444444 + 0 in total (QC-passed reads + QC-failed reads) 5863846 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 43431948 + 0 mapped (99.97%:-nan%) 37580598 + 0 paired in sequencing

SAM文件中的soft clipping和hard clipping

clipped alignment因为着在比对过程中,并没有用到全部的read的序列,read两段的序列被截取了(clip or trim)。如下表示,即为clip alignment。

Alignment:

Read:          ACGGTTGCGTTAA-TCCGCCACG
|                           ||||||||| ||||||
Reference: TAACTTGCGTTAAATCCGCCTGG

与clipped alignment对应的是spliced alignment,即read的中间没有比对到而两段比对上了。对应的表示如下:

Alignment:

Read:          ACGGTTGCGTTAAGCTCATCCGCCACG
|                 |||||||||||||         |||||||||
Reference: ACGGTTGCGTTAA…..TCCGCCACG

clip alignment对应的CIGAR表示有两种S (soft clip) 和H (hard clip)。 BWA提到If the read has a chimeric alignment, the paired or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100 bits. All the other hits part of the chimeric alignment will use hard clipping and be marked with 0x800 if option “-M” is not in use, or marked with 0x100 otherwise.

即如果发现嵌合比对,最好的比对top hit标记为soft clipping,其余的则标记为hard clipping。

如果是hard clip,则截取的部分不会在SAM文件对应的read中出现 (clipped sequences not present in SEQ),如果是soft clip (clipped sequences present in SEQ),则会出现。

人基因组每条染色体的GC含量GC content of human chromosomes

The GC content is the molar ratio of guanine+cytosine bases in DNA. The human genome is a mosaic of GC-rich and GC-poor regions, of around 300kb in length, called isochores. GC content is an important factor in many experiments and bioinformatic analysis. This is especially true for next-generation sequencing where the DNA being sequenced has gone through multiple rounds of PCR amplification.