前面讲了用工具定量小RNA，但要么软件安装困难，要么维护不好给种错误，所以自己也搭建（抄）了一套流程。主要思路是bowtie比对，HTSeq定量。其实也是利用了HTSeq定量需要gff文件的特点，比对到全基因组后只需要准备好gtf文件即可。

bowtie和bowtie的index文件我已经有了，是在miRDeep2流程中定量miRNA准备的，本文主要介绍定量其他种类的非编码小RNA。

比对

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


zcat input.fastq.gz | bowtie --seedlen 10 -p 48 -v2 -m20 --best -S --strata --chunkmbs 8000 $BOWTIE_IND - output.sam

-M 1           like -m, but reports 1 random hit (MAPQ=0); requires --best
- 为zcat的输入，bowtie不识别gz，所以用zcat管道流入bowtie
-p 48 线程
-v report end-to-end hits w/ <=v mismatches; ignore qualitie
-m20 丢弃超过20次multi-map的reads
--best --strata multi-map的reads只报告最好的比对
--chunkmbs 8000 best-first搜索的最大内存（M）
-S 输出SAM格式
BOWTIE_IND bowtie的索引

SAM文件处理

1

samtools sort --threads 10 out.sam -o out_sorted.bam --output-fmt BAM && samtools index out_sorted.bam

gtf文件准备

snRNA,snoRNA等（来自gencode的注释）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


axel -n 10 https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz

zcat gencode.v44.annotation.gtf.gz | grep "snoRNA" > snoRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "Mt_rRNA" > Mt_rRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "Mt_tRNA" > Mt_tRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "rRNA" > rRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "snRNA" > snRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "sRNA" > sRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "scaRNA" > scaRNA.gtf
zcat gencode.v44.annotation.gtf.gz | grep "miRNA" > miRNA.gtf

tRNA

1
2
3


# 也来自gencode，tRNA genes predicted by ENSEMBL on the reference chromosomes using tRNAscan-SE
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.tRNAs.gtf.gz
zcat gencode.v44.tRNAs.gtf.gz > tRNA.gtf

piRNA

1
2
3
4
5


# piRBase官网没打开，从这个网站下载的http://bigdata.ibp.ac.cn/piRBase/download.php
# 先过滤补丁染色体，然后把MT改成M，然后加chr
wget http://bigdata.ibp.ac.cn/piRBase/download/v3.0/bed/hsa.align.bed.gz
zcat hsa.align.bed.gz | grep -v "CHR_" | grep -v "KI"  | grep -v "GL" | sed "s#^MT#M#" |awk '{print "chr"$1"\tpiRBase\tpiRNA\t"$2"\t"$3"\t.\t"$6"\t.\tname \""$4"\"; gene_id \""$4"\"; gene_type \"piRNA\"; gene_name \""$4"\";"}' > piRNA.gtf

合并gtf

1
2
3
4
5
6


cat piRNA.gtf tRNA.gtf miRNA.gtf snRNA.gtf snoRNA.gtf scaRNA.gtf Mt_tRNA.gtf rRNA.gtf Mt_rRNA.gtf >  ncRNA_gencode.gtf

# mamba install -c bioconda gff3sort
# 还有feature为tRNA和piRNA（第三列）改为了gene，这样htseq可以指定统计feature为gene
gff3sort.pl --precise --chr_order natural ncRNA_gencode.gtf | grep -v "^#" | awk -F "\t" -v OFS='\t' '{if($3 == "piRNA" || $3 == "tRNA") {$3="gene"}; print $0}' > ncRNA_gencode_sort.gtf

HTSeq-Count定量

1
2
3
4
5
6
7
8
9


# pip install HTSeq
# https://htseq.readthedocs.io/en/release_0.11.1/count.html
# HTSeq-Count需要的内存大，如果是集群上跑任务，建议设置大内存，避免被killed掉（因为我碰到这种情况了-_-||）

htseq-count -m union --stranded yes -t gene out_sorted.bam $GENEREF > output_counts_ncRNA.txt

GENEREF ncRNA_gencode_sort.gtf路径
-m union Counting mode setting
--stranded 文库是否是链特异性的

count还有用Subread软件的featureCounts，推荐用featureCounts，速度快内存小。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


# install -c bioconda subread
# https://subread.sourceforge.net/featureCounts.html
# -s 1 strand specific
# -O Assign reads to all their overlapping meta-features (or features if -f is specified).
# -M Multi-mapping reads will also be counted. For a multi-mapping read, all its reported alignments will be
# counted. The 'NH' tag in BAM/SAM input is used to detect multi-mapping reads.

featureCounts -s 1 -T 16 -t gene -g gene_id --extraAttributes gene_type -a $GENEREF -o featureCount.txt out_sorted.bam

-O 如果一个read比对到overlapping的feature上，两个feature都被认为有比对

下一步分析

piRNA实在太多了（piRBase里面的piRNA有10615833条），差异分析之前，可以先把没有表达的piRNA过滤掉，可以过掉很多piRNA。下一步可以根据sncRNA的类型，或者综合分析。normalization，differential expression analysis，sncRNA type等。

还有其他非编码的小RNA，如tsRNA和vtRNA。

其他工具

exceRpt：The extra-cellular RNA processing toolkit （好用，但是需要docker，自己安装麻烦https://www.zxzyl.com/archives/1455/有介绍） http://github.gersteinlab.org/exceRpt/

COMPSRA（后期维护不好，文档显得有点乱，https://www.zxzyl.com/archives/1478/）: a COMprehensive Platform for Small RNA-seq data AnalySis：https://github.com/cougarlj/COMPSRA

seqpac：A Framework for small RNA analysis in R using Sequence-Based Counts：https://rpubs.com/signeskog/seqpac

SortMeRNA：https://github.com/sortmerna/sortmerna，可以指定多种reference，比如鉴定病毒序列等

bcbio-nextgen（软件难安装）：https://bcbio-nextgen.readthedocs.io/en/latest/contents/small_rnaseq.html

这篇文章有讲自己如何比对non-coding RNA，我参考了这个：https://www.cell.com/molecular-therapy-family/nucleic-acids/pdfExtended/S2162-2531(18)30210-5

https://github.com/bcbio/bcbio-nextgen/issues/2439 这个里面提到了piRNA等注释文件

SPORTS1.0: A Tool for Annotating and Profiling Non-coding RNAs Optimized for rRNA- and tRNA-derived Small RNAs https://github.com/junchaoshi/sports1.1

tRNA fragments profiling

https://github.com/TJU-CMC-Org/MINTmap/

tRNA annotation

https://bioconductor.org/packages/devel/bioc/vignettes/tRNA/inst/doc/tRNA.html

tRNA db

http://trna.bioinf.uni-leipzig.de/DataOutput/

####################################################################

#Author: Jason

#####################################################################

比对、定量多种类型小RNA

文章目录

比对