前面讲了用工具定量小RNA,但要么软件安装困难,要么维护不好给种错误,所以自己也搭建(抄)了一套流程。主要思路是bowtie比对,HTSeq定量。其实也是利用了HTSeq定量需要gff文件的特点,比对到全基因组后只需要准备好gtf文件即可。

bowtie和bowtie的index文件我已经有了,是在miRDeep2流程中定量miRNA准备的,本文主要介绍定量其他种类的非编码小RNA。

比对

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
zcat input.fastq.gz | bowtie --seedlen 10 -p 48 -v2 -m20 --best -S --strata --chunkmbs 8000 $BOWTIE_IND - output.sam

-M 1           like -m, but reports 1 random hit (MAPQ=0); requires --best
- 为zcat的输入,bowtie不识别gz,所以用zcat管道流入bowtie
-p 48 线程
-v report end-to-end hits w/ <=v mismatches; ignore qualitie
-m20 丢弃超过20次multi-map的reads
--best --strata multi-map的reads只报告最好的比对
--chunkmbs 8000 best-first搜索的最大内存(M)
-S 输出SAM格式
BOWTIE_IND bowtie的索引

SAM文件处理

1
samtools sort --threads 10 out.sam -o out_sorted.bam --output-fmt BAM && samtools index out_sorted.bam

gtf文件准备

snRNA,snoRNA等(来自gencode的注释)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
axel -n 10 https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz

zcat gencode.v44.annotation.gtf.gz | grep "snoRNA" > snoRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "Mt_rRNA" > Mt_rRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "Mt_tRNA" > Mt_tRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "rRNA" > rRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "snRNA" > snRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "sRNA" > sRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "scaRNA" > scaRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "miRNA" > miRNA.gtf

tRNA

1
2
3
# 也来自gencode,tRNA genes predicted by ENSEMBL on the reference chromosomes using tRNAscan-SE
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.tRNAs.gtf.gz
zcat gencode.v44.tRNAs.gtf.gz > tRNA.gtf

piRNA

1
2
3
4
5
# piRBase官网没打开,从这个网站下载的http://bigdata.ibp.ac.cn/piRBase/download.php
# 先过滤补丁染色体,然后把MT改成M,然后加chr
wget http://bigdata.ibp.ac.cn/piRBase/download/v3.0/bed/hsa.align.bed.gz
zcat hsa.align.bed.gz | grep -v "CHR_" | grep -v "KI"  | grep -v "GL" | sed "s#^MT#M#" |awk '{print "chr"$1"\tpiRBase\tpiRNA\t"$2"\t"$3"\t.\t"$6"\t.\tname \""$4"\"; gene_id \""$4"\"; gene_type \"piRNA\"; gene_name \""$4"\";"}' > piRNA.gtf

合并gtf

1
2
3
4
5
6
cat piRNA.gtf tRNA.gtf miRNA.gtf snRNA.gtf snoRNA.gtf scaRNA.gtf Mt_tRNA.gtf rRNA.gtf Mt_rRNA.gtf >  ncRNA_gencode.gtf

# mamba install -c bioconda gff3sort
# 还有feature为tRNA和piRNA(第三列)改为了gene,这样htseq可以指定统计feature为gene
gff3sort.pl --precise --chr_order natural ncRNA_gencode.gtf | grep -v "^#" | awk -F "\t" -v OFS='\t' '{if($3 == "piRNA" || $3 == "tRNA") {$3="gene"}; print $0}' > ncRNA_gencode_sort.gtf

HTSeq-Count定量

1
2
3
4
5
6
7
8
9
# pip install HTSeq
# https://htseq.readthedocs.io/en/release_0.11.1/count.html
# HTSeq-Count需要的内存大,如果是集群上跑任务,建议设置大内存,避免被killed掉(因为我碰到这种情况了-_-||)

htseq-count -m union --stranded yes -t gene out_sorted.bam $GENEREF > output_counts_ncRNA.txt

GENEREF ncRNA_gencode_sort.gtf路径
-m union Counting mode setting
--stranded 文库是否是链特异性的

count还有用Subread软件的featureCounts,推荐用featureCounts,速度快内存小。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# install -c bioconda subread
# https://subread.sourceforge.net/featureCounts.html
# -s 1 strand specific
# -O Assign reads to all their overlapping meta-features (or features if -f is specified).
# -M Multi-mapping reads will also be counted. For a multi-mapping read, all its reported alignments will be
# counted. The 'NH' tag in BAM/SAM input is used to detect multi-mapping reads.

featureCounts -s 1 -T 16 -t gene -g gene_id --extraAttributes gene_type -a $GENEREF -o featureCount.txt out_sorted.bam

-O 如果一个read比对到overlapping的feature上,两个feature都被认为有比对

下一步分析

piRNA实在太多了(piRBase里面的piRNA有10615833条),差异分析之前,可以先把没有表达的piRNA过滤掉,可以过掉很多piRNA。下一步可以根据sncRNA的类型,或者综合分析。normalization,differential expression analysis,sncRNA type等。

还有其他非编码的小RNA,如tsRNA和vtRNA。

其他工具

exceRpt:The extra-cellular RNA processing toolkit (好用,但是需要docker,自己安装麻烦https://www.zxzyl.com/archives/1455/有介绍) http://github.gersteinlab.org/exceRpt/

COMPSRA(后期维护不好,文档显得有点乱,https://www.zxzyl.com/archives/1478/): a COMprehensive Platform for Small RNA-seq data AnalySis:https://github.com/cougarlj/COMPSRA

seqpac:A Framework for small RNA analysis in R using Sequence-Based Counts:https://rpubs.com/signeskog/seqpac

SortMeRNA:https://github.com/sortmerna/sortmerna,可以指定多种reference,比如鉴定病毒序列等

bcbio-nextgen(软件难安装):https://bcbio-nextgen.readthedocs.io/en/latest/contents/small_rnaseq.html

这篇文章有讲自己如何比对non-coding RNA,我参考了这个:https://www.cell.com/molecular-therapy-family/nucleic-acids/pdfExtended/S2162-2531(18)30210-5

https://github.com/bcbio/bcbio-nextgen/issues/2439 这个里面提到了piRNA等注释文件

SPORTS1.0: A Tool for Annotating and Profiling Non-coding RNAs Optimized for rRNA- and tRNA-derived Small RNAs https://github.com/junchaoshi/sports1.1

tRNA fragments profiling

https://github.com/TJU-CMC-Org/MINTmap/

tRNA annotation

https://bioconductor.org/packages/devel/bioc/vignettes/tRNA/inst/doc/tRNA.html

tRNA db

http://trna.bioinf.uni-leipzig.de/DataOutput/

####################################################################

#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任

#Author: Jason

#####################################################################