标签归档:gtf

统计GTF文件中转录本的长度 Calculate transcript length from gtf file

gtf 文件INPUT

chr1    PacBio  exon    763020  763155  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "LINC01128"; oId "PB.5.1"; nearest_ref "NR_047519"; class_code "j"; tss_id "TSS1";
chr1    PacBio  exon    764383  764484  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "LINC01128"; oId "PB.5.1"; nearest_ref "NR_047519"; class_code "j"; tss_id "TSS1";
chr1    PacBio  exon    776580  776753  .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "LINC01128"; oId "PB.5.1"; nearest_ref "NR_047519"; class_code "j"; tss_id "TSS1";

Code

awk -F"\t" '
{
if ($3=="exon")
    {
        split($9, a, " ");
        ID=a[4]; 
        L[ID]+=$5-$4+1
    } 
}
END{
    for(i in L)
        {print i"\t"L[i]}
    }
' gtf-file >output

You can change array index and feature type to choose what you want to count. For example, if you want to calculate gene length, change exon to gene (make sure your gtf file has this feature).

输出OUTPUT

"TCONS_00052437";       5950
"TCONS_00049398";       988
"TCONS_00031225";       6005
"TCONS_00026369";       825

参考:https://gist.github.com/sp00nman/10372555

RefSeq的gtf文件

注释有很多版本,比如ensembl,gencode, ucsc known gene, NCBI的RefSeqGene。最近就需要NM id的注释,但NCBI提供的是gff3格式的,而且很乱。用UCSC table browser下载的gtf版本的RefSeq,没有转录本和基因之间的关系,也没有基因symbol。

比如Ensembl,其实Ensembl的gtf挺好用的,不过这次我因为需要NM编号的注释(笨方法是将ensembl id转成NCBI的refSeq的ID,但这不是最优的方法,ID mapping有可能对不上,不如直接用NM的注释)。

chr1    ensembl_havana  gene    11869   14412   .       +       .       gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
chr1    havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "4"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic";
chr1    havana  exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; gene_version "4"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic";

UCSC table browser下载的refGene的gtf,这个文件不对的地方是gene id和transcript id是一个,而我需要gene和transcript的关系

chr1    hg19_refGene    exon    66999276        66999355        0.000000        +       .       gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1    hg19_refGene    start_codon     67000042        67000044        0.000000        +       .       gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1    hg19_refGene    CDS     67000042        67000051        0.000000        +       0       gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1    hg19_refGene    exon    66999929        67000051        0.000000        +       .       gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1    hg19_refGene    CDS     67091530        67091593        0.000000        +       2       gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1    hg19_refGene    exon    67091530        67091593        0.000000        +       .       gene_id "NM_001308203"; transcript_id "NM_001308203";

中间找了很多,终于找到一个方法,https://bioinformatics.stackexchange.com/questions/2548/hg38-gtf-file-with-refseq-annotations

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg19 | cut -f2- | genePredToGtf -source=hg19.refGene.ucsc file stdin stdout > refSeq.hg19.gtf

genePredToGtf可以从UCSC下载,需要注意的直接下载refFlat的文件(不通过mysql),用genePhredToGtf生成的文件和UCSC table browser下载的gtf文件是一样的,通过mysql下载用genephreadToGtf是我想要的。

比如这样,这才是我想要的

chrX    hg19.refGene.ucsc       exon    70683674        70685855        .       +       .       gene_id "TAF1"; transcript_id "NM_138923"; exon_number "38"; exon_id "NM_138923.38"; gene_name "TAF1";
chrX    hg19.refGene.ucsc       CDS     70683674        70683893        .       +       1       gene_id "TAF1"; transcript_id "NM_138923"; exon_number "38"; exon_id "NM_138923.38"; gene_name "TAF1";
chrX    hg19.refGene.ucsc       start_codon     70586225        70586227        .       +       0       gene_id "TAF1"; transcript_id "NM_138923"; exon_number "1"; exon_id "NM_138923.1"; gene_name "TAF1";
chrX    hg19.refGene.ucsc       stop_codon      70683894        70683896        .       +       0       gene_id "TAF1"; transcript_id "NM_138923"; exon_number "38"; exon_id "NM_138923.38"; gene_name "TAF1";

Ref: https://bioinformatics.stackexchange.com/questions/2548/hg38-gtf-file-with-refseq-annotations
#####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
#####################################################################

Convert gtf to bed12 format — gtf2bed

Some software requires bed12 format, not gtf/gff. So a convertion work should be done.

bedops gtf2bed

Easy way to get your result —- bedops

gtf2bed < Homo_sapiens.GRCh38.85.gtf | head
1       11868   14409   ENST00000456328 0       +       11868   14409   0       3       359,109,1189,   0,744,1352,
1       12009   13670   ENST00000450305 0       +       12009   13670   0       6       48,49,85,78,154,218,    0,169,603,965,1211,1443,
1       17368   17436   ENST00000619216 0       -       17368   17436   0       1       68,     0,
1       14403   29570   ENST00000488147 0       -       14403   29570   0       11      98,34,152,159,198,136,137,147,99,154,37,        0,601,1392,2203,2454,2829,3202,3511,3864,10334,15130,
1       29553   31097   ENST00000473358 0       +       29553   31097   0       3       486,104,122,    0,1010,1422,
1       30266   31109   ENST00000469289 0       +       30266   31109   0       2       401,134,        0,709,
1       30365   30503   ENST00000607096 0       +       30365   30503   0       1       138,    0,
1       34553   36081   ENST00000417324 0       -       34553   36081   0       3       621,205,361,    0,723,1167,
1       35244   36073   ENST00000461467 0       -       35244   36073   0       2       237,353,        0,476,
1       52472   53312   ENST00000606857 0       +       52472   53312   0       1       840,    0,

Here we use UCSC’s package to convert gtf format file to bed12 format file.

Download UCSC package

wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod 755 gtfToGenePred genePredToBed

Convert gtf to bed12

time gtfToGenePred Homo_sapiens.GRCh38.85.gtf test.genePhred && genePredToBed test.genePhred result.bed && rm test.genePhred 

real    0m14.959s
user    0m14.387s
sys     0m0.567s
head reult.bed 
1       11868   14409   ENST00000456328 0       +       14409   14409   0       3       359,109,1189,   0,744,1352,
1       12009   13670   ENST00000450305 0       +       13670   13670   0       6       48,49,85,78,154,218,    0,169,603,965,1211,1443,
1       14403   29570   ENST00000488147 0       -       29570   29570   0       11      98,34,152,159,198,136,137,147,99,154,37,       0,601,1392,2203,2454,2829,3202,3511,3864,10334,15130,
1       17368   17436   ENST00000619216 0       -       17436   17436   0       1       68,     0,
1       29553   31097   ENST00000473358 0       +       31097   31097   0       3       486,104,122,    0,1010,1422,
1       30266   31109   ENST00000469289 0       +       31109   31109   0       2       401,134,        0,709,
1       30365   30503   ENST00000607096 0       +       30503   30503   0       1       138,    0,
1       34553   36081   ENST00000417324 0       -       36081   36081   0       3       621,205,361,    0,723,1167,
1       35244   36073   ENST00000461467 0       -       36073   36073   0       2       237,353,        0,476,
1       52472   53312   ENST00000606857 0       +       53312   53312   0       1       840,    0,

Cheers.

Another tool

I also recommend a perl script — gtf2bed. This script works fine to convert gtf to bed format. Thanks for ea-utils project.

Ref

https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/fTNuZOS-CYA

https://expressionanalysis.github.io/ea-utils/

bedops.readthedocs.io

#####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
####################################################################