转录组分析新工具流程–HISAT2-stringtie-ballgown

一,HISAT2(Hierarchical Indexing for Spliced Alignment of Transcripts2)

HISAT2是一个对比对RNA-seq reads的快速灵敏的spliced alignment工具,HISAT2支持DNA和RNA比对。针对reads覆盖多个外显子,HISAT其包含两种索引:1,global FM索引,代表整个基因组,2,许许多多的local FM索引,每个索引代表~56,000bp,~55,000个local索引覆盖整个基因组。HISAT基于Bowtie2来处理大多数FM索引。心的索引scheme叫做 Hierarchical Graph FM index (HGFM)。HISAT也支持indel和paire end模式,并且支持多线程和SRA Toolkit。HISAT2官网提到HISAT2是HISAT和TopHat2继承,建议大家从HISAT和TopHat2迁移到HISAT2上来。

HISAT2的优势如图(图片来自HISAT2文章)

下载地址:
 http://ccb.jhu.edu/software/hisat2/downloads/
 wget http://ccb.jhu.edu/software/hisat2/downloads/hisat2-2.0.0-beta-Linux_x86_64.zip -P ./
 解 压 缩:
 unzip hisat2-2.0.0-beta-Linux_x86_64.zip

个人理解,在进行RNA序列比对时,提供剪切位点,能够使软件更快的比对read在基因组的位置。在运行hisat2时,可以通过–known-splicesite-infile指定splice的位置,但HISAT2提到,在build的时候加入splice site更好,所以在build之前,需要准备记录splice sites和exons的文件。HISAT2自带通过基因组gft注释文件提取splice site和exon的脚本。
# gtf 文件来自http://www.gencodegenes.org/releases/19.html#

 extract_exons.py hg19.annotation.gtf > exons.txt
 extract_splice_sites.py hg19.annotation.gtf > splicesites.txt

提取后的输出格式为
chr0 based left pos0 based left leftstrand

./hisat2-2.0.0-beta/hisat2-build -f ucsc.hg19.fasta --ss splicesites.txt --exon exons.txt -p 7 ./ucsc.hg19
## 添加--ss和--exon选项后,需要很大的内存,build 人基因组的话需要200G RAM,如果没有这么大内存,不要添加这两个选项,但要在后续运行hisat时添加 --known-splicesite-infile选项
hisat2-build -f ucsc.hg19.fasta -p 7 ./uscs.hg19 ##大概需要一小时二十分钟
build index的过程和bowtie2很像。
#-q指定fastq格式
hisat2 -q -x ./ucsc.hg19 -1 reads_1.fastq -2 reads_2.fastq -S alns.sam -t
## hisat2 -q -x ./ucsc.hg19 -1 reads_1.fastq -2 reads_2.fastq -S alns.sam --known-splicesite-infile splicesites.txt -t

index时输出的信息

index之后会生成

二,stringtie

比对完之后,就要确定基因或者转录本的表达量了。这里用到了第一步用到的gtf文件。

stringtie的输入BAM文件需要先进行sort
samtools view -Su alns.sam | samtools sort - alns.sorted
stringtie alns.sorted.bam -b stringtie_input_dir -e -G hg19.annotation.gtf -C cov_ref.gtf -p 7 -o stringtie.out.gtf
# -B This switch enables the output of Ballgown input table files (*.ctab) containing coverage data for the reference transcripts given with the -G option. or -b Just like -B this option enables the output of *.ctab files for Ballgown
# -e Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts given with the -G option
# -C StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).

在-b 指定的文件夹下生成特定的文件
e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab
e即外显子、i即内含子、t转录本;e2t即外显子和转录本间的关系,i2t即内含子和转录本间的关系,t_data即转录本的数据

运行过程中的输出信息

三,ballgown

https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html

ballgown的流程示例以ballgown自带的数据为例。

source("http://bioconductor.org/biocLite.R")
 biocLite("ballgown")
 library(ballgown)
 data_directory = system.file('extdata', package='ballgown')
 #data_directory = c('./stringtie.input/')
 bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all') # load data
 # 两组,每组各十个样本
 pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10)) # get pData
 stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='group',getFC = TRUE)
 write.table(stat_results,file="./result.out",sep="t",quote =FALSE,row.names=FALSE)

extdata的文件夹结构如下:
extdata/
—sample01/
——e2t.ctab
——e_data.ctab
——i2t.ctab
——i_data.ctab
——t_data.ctab
—sample02/
——e2t.ctab
——e_data.ctab
——i2t.ctab
——i_data.ctab
——t_data.ctab

—sample20/
——e2t.ctab
——e_data.ctab
——i2t.ctab
——i_data.ctab
——t_data.ctab

设置getFC为真才会得到fold change的值,输出如下:

转录组分析新工具流程–HISAT2-stringtie-ballgown》上有7条评论

  1. rose

    你好 如果有group信息不止两种condition要怎么做呢 为什么stattest的结果还是只有两两比较呢?谢谢!

    回复
    1. zzx

      stattest automatically handles two-group (e.g. case/control) comparisons, multi-group comparisons (e.g. comparison of several tissue types), and “timecourse” comparisons (with the scare quotes meaning that these comparisons are also applicable to continuous covariates that aren’t time). For two- and multi-group comparisons, a significant result indicates that the feature is differentially expressed in at least one of the groups. For timecourse comparisons, significant results mean the feature has an expression profile that varies significantly over time (i.e., values of the continuous covariate) as opposed to being flat over time.

      上面摘自 https://github.com/alyssafrazee/ballgown

      用自带数据测试(示例操作)

      library(ballgown)
       data_directory = system.file('extdata', package='ballgown')
       bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all') # load data
       # 四组,每组各五个样本
       pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=5)) # get pData
       stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='group') #   getFC = TRUE  参数要去掉,不适合多组类型
       head(stat_results)
           feature id       pval      qval
      1 transcript 10 0.04156457 0.3165302
      2 transcript 25 0.57694188 0.9363483
      3 transcript 35 0.04055960 0.3165302
      4 transcript 41 0.13300061 0.6924604
      5 transcript 45 0.28969696 0.9138395
      6 transcript 67 0.41803883 0.9322752
      回复
  2. 小虫

    Error in ballgown(dataDir = data_directory, samplePattern = “sample”, :
    intron ids were either not the same or not in the same order across samples. double check i_data.ctab for each sample.
    In addition: Warning messages:
    1: In x$i_id != intronAll[[1]]$i_id :
    longer object length is not a multiple of shorter object length
    2: In x$i_id != intronAll[[1]]$i_id :
    longer object length is not a multiple of shorter object length

    回复

发表评论

电子邮件地址不会被公开。 必填项已用*标注

This site uses Akismet to reduce spam. Learn how your comment data is processed.