FPKM,RPKM,TPM 转载

2016-03-13

FPKM：Fragmentsper Kilobase Million，FPKM意义与RPKM极为相近。二者区别仅在于，Fragment 与Read。RPKM的诞生是针对早期的SE测序，FPKM则是在PE测序上对RPKM的校正。只要明确Reads 和Fragments的区别，RPKM和FPKM的概念便易于区分。Reads即是指下机后fastq数据中的每一条Reads，Fragments则是指每一段用于测序的核酸片段，在SE中，一个Fragments只测一条Reads，所以，Reads数与Fragments数目相等；在PE中，一个Fragments测两端，会得到2条Reads，但由于后期质量或比对的过滤，有可能一个Fragments的2条Reads最后只有一条进入最后的表达量分析。总之，对某一对Reads而言，这2条Reads只能算一个Fragments，所以，Fragment的最终数目是Reads的1到2倍之间。

在衡量基因表现量时，若是单纯以map到的read数来计算基因的表现量，在统计上是一件相当不合理事，因为在随机抽样的情况下，序列较长的基因被抽到的机率本来就会比序列短的基因较高，如此一来，序列长的基因永远会被认为表现量较高，而错估基因真正的表现量，所以Ali Mortazavi等人在2008年提出以RPKM在估计基因的表现量。

“Reads Per Kilobase Per Million Reads"，即"每一百万条Reads中，对基因的每1000个Base而言，比对到该1000个base的Reads数”

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Million) is now becoming quite popular. Since there seems to be a lot of confusion about these terms, I thought I’d use a StatQuest to clear everything up.

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM: Count up the total reads in a sample and divide that number by 1,000,000 - this is our “per million” scaling factor. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM) Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

启动子分析预测数据库

2016-03-12

Default Category

Promoter 2.0 Prediction Server http://www.cbs.dtu.dk/services/Promoter/ 很早之前的预测启动子在线软件，要求输入的gene序列为FASTA格式，可以在线做

Berkeley Drosophila Genome Group http://fruitfly.org:9005/seq_tools/promoter.html 果蝇基因组相关信息，利用神经网络预测启动子序列

McPromoter http://tools.genome.duke.edu/generegulation/McPromoter/McPromoter.html 马尔科夫预测gene的转录其实位点，新版本只提供果蝇

亚效等位基因和超效等位基因

2016-03-12

Default Category

Hypomorph 亚效等位基因，亚等位基因亚效等位基因有基因功能，但功能比正常的要少。这种等位基因通常由于基因损害，更常见的是损害调控这个基因的功能区。通常

Sort VCF by Chr and Pos根据染色体位置对VCF进行排序

2016-03-01

Default Category

对VCF文件中的突变按照染色体和位置进行排序，下面是本人的总结，其中利用bash命令的方法不依赖其他的工具或包。htslib前文中也提到过。

1, Use bash

bash raw.vcf

1
2
3
4
5
6
7
8
9


chr_order="chrMnchr1nchr2nchr3nchr4nchr5nchr6nchr7nchr8nchr9nchr10nchr11nchr12nchr13nchr14nchr15nchr16nchr17nchr18nchr19nchr20nchr21nchr22nchrXnchrY"

cat "$1" ' grep "^#" > .header.vcf
cat "$1" ' grep -v "^#" ' sort -k1,1 -k2,2n > .pre.sorted.vcf
echo -e $chr_order ' while read line
do
    cat .pre.sorted.vcf ' grep "^$line"$'t' >> .header.vcf
done
mv .header.vcf  sorted.vcf && rm .header.vcf .pre.sorted.vcf

2，Use awk and sed

(awk ‘/^#/{print}!/^#/{exit}’ raw.vcf;sed ‘/^#/d’ raw.vcf’awk -F"\t" ‘($1~/^[0-9]+$/){sub("^chr","",$0);print $0}‘‘sort -k1,1n -k2,2n’awk ‘{print “chr”$0}’ ;sed ‘/^#/d’ raw.vcf’ awk -F"\t" ‘($1!~/^[0-9]+$/){sub("^chr","",$0);print $0}‘‘sort -k1,1d -k2,2n’awk ‘{print “chr”$0}') > sort.vcf

3, Use Picard

Sorts one or more VCF files. This tool sorts the records in VCF files according to the order of the contigs in the header/sequence dictionary and then by coordinate. It can accept an external sequence dictionary.

java -jar picard.jar SortVcf I=unsort.vcf O=sorted.vcf

4，Use vcf-sort (in vcftools)

cat file.vcf ' vcf-sort > sorted.vcf

合并VCF文件 Merge VCF file

2016-02-29

Default Category

VCF为variant call file格式，现为标准的SNV突变存储格式。通常情况下，一个VCF文件对应一个样本的突变。但VCF格式同样支持同时在一个文件中表示多个样本的突变，在每一行的最后几列，每一列代表一个样本在特定位点的突变情况，若样本在该位点没有突变，在代表样本的那一列在该位点的记录为.点号。此外，在生成VCF的时候，可以生成包含多个样本突变的VCF。但是要在序列比对生成SAM文件时要加入SM tag用来指定哪些reads属于哪个样本。

多个样本的突变情况用一个VCF文件存储的好处在于，对于发生突变的特定位点，可以迅速了解不同样本在该位点的突变情况。比如下列，最后三列表示三个样本在染色体chrT的525位点的突变情况，

1
2


#对应GT:PL:GQ:AD:DP。其中第一个样本在该位点没有发生突变，第二个样本的基因型为C/C，第三个样本的基因型为G/T。
chrT    515    .    G    C,T    1230.23    PASS    AC=4;AF=1.00;AN=4;DP=76;FS=0.000;MLEAC=2;MLEAF=1.00;MQ0=0;MQ=60.24;QD=33.32;RPA=5,4;RU=CA;SF=1,2;SOR=0.793;STR;set=variant23    GT:PL:GQ:AD:DP    .    1/1:1570,120,0:99:0,40:46    0/2:965,69,0:69:0,23:30

一，vcftools

VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.

VCFtools的功能和SAMtools类似，用于处理VCF格式的文件，可用于合并VCF。

VCFtools的输入文件格式为gz格式，需要用bgzip压缩，并用tabix生成index文件。 tabix和bgzip在samtools安装包的htslib-*文件夹中，需要单独安装。首先下载samtools（也可以专门下载htslib包，或者你已经有samtools，看是否有htslib文件夹），进入htslib文件夹中

1
2
3
4
5
6


wget https://github.com/samtools/htslib/releases/download/1.3/htslib-1.3.tar.bz2
tar -jxvf htslib-1.3.tar.bz2
cd htslib-1.3
./configure 
make
make prefix=/path/to/install install

download vcftools，install vcftools

1
2
3
4
5
6


git clone https://github.com/vcftools/vcftools.git
cd vcftools/ 
./autogen.sh 
./configure 
make 
make install

compress, index and merge VCF

1
2
3
4
5
6
7


bgzip -c "x.vcf" > "x.vcf.gz"
bgzip -c "y.vcf > "y.vcf.gz"
bgzip -c "z.vcf" > "z.vcf.gz"
tabix -f "x.vcf.gz"
tabix -f "y.vcf.gz"
tabix -f "z.vcf.gz"
vcf_merge   x.vcf.gz   y.vcf.gz   z.vcf.gz  > merged.vcf