注释有很多版本,比如ensembl,gencode, ucsc known gene, NCBI的RefSeqGene。最近就需要NM id的注释,但NCBI提供的是gff3格式的,而且很乱。用UCSC table browser下载的gtf版本的RefSeq,没有转录本和基因之间的关系,也没有基因symbol。
比如Ensembl,其实Ensembl的gtf挺好用的,不过这次我因为需要NM编号的注释(笨方法是将ensembl id转成NCBI的refSeq的ID,但这不是最优的方法,ID mapping有可能对不上,不如直接用NM的注释)。
1
2
3
|
chr1 ensembl_havana gene 11869 14412 . + . gene_id "ENSG00000223972"; gene_version "4"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene";
chr1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "4"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic";
chr1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "4"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "ensembl_havana"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic";
|
UCSC table browser下载的refGene的gtf,这个文件不对的地方是gene id和transcript id是一个,而我需要gene和transcript的关系
1
2
3
4
5
6
|
chr1 hg19_refGene exon 66999276 66999355 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1 hg19_refGene exon 66999929 67000051 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_001308203"; transcript_id "NM_001308203";
chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_001308203"; transcript_id "NM_001308203";
|