标签归档:Python

Parse gtf

I always use gtf file and retrieve gene information. There isn’t a highly flexible tool to solve my demand. I modified the code from “https://github.com/Jverma/GFF-Parser”, thanks Jverma. This tool will be easier to use.

/wp/f4w/2020/2020-10-02-gtf-parser.gif

Usage

Basically, there are three parameters.

id: either transcript id or gene id.

attType: attribute defined in gtf file. E.g. feature (column 3), gene_name, exon_number, transcript_id in column 9

attValue: the attribute value you want to search for.

>>> import sys
>>> from gtfParser import gtfParser

>>> gtf = gtfParser("example.gtf")

>>> # Get all exons in CDK7
>>> gtf.getRecordsByID("CDK7", "feature", "exon")

>>> # Get all features of transcript_id defined as "NM_001324069" in gene "CDK7"
>>> gtf.getRecordsByID("CDK7", "transcript_id", "NM_001324069")

>>> # Get start codon where feature was defined as "start_codon" in transcript "NM_001324069"
>>> gtf.getRecordsByID("NM_001324069", "feature", "start_codon")

>>> # Get a exon where its id is "NM_001324078.1" in "NM_001324078" transcript
>>> gtf.getRecordsByID("NM_001324078", "exon_id", "NM_001324078.1")

# Example gtf
Here is an simple example of gtf file. You can use to test. A subset from refSeq.hg38.gtf.

继续阅读

CSV格式与tab制表符分割的格式文件相互转换,支持管道操作

Annovar的注释结果,如果输出制表符分割的VCF格式,显得混乱。如果输出为csv格式,方便windows下的用户用excel打开,但不方便数据处理,比如某一列的注释信息中包含了逗号,这种情况就要特别注意。python中有csv模块可以方便的读取csv,推荐使用。

本文写的小脚本主要处理简单的csv格式,亮点在于支持接收标准输入和标准输出,方便生物信息多命令之间通过管道进行处理。如果没有指定输入文件,则读取管道流数据,如果没有指定输出文件,则可以用管道接收数据进行下一步处理。

比如 cat xxx.csv | python convert.py | grep "xx" > result.txt
或者 python convert.py -i input.csv | grep "xx" > result.txt
或者 python convert.py -i input.csv - o result.txt
查看用法 python convert.py --help

继续阅读

Python抓取动态网页

生物信息学中,David(the Database for Annotation,Visualization and Integrated Discovery)是常用的注视工具,可以对基因进行GO注释,KEGG pathway注释等,David提供接口供批量注释调用。

David的网址https://david.ncifcrf.gov/,api介绍https://david.ncifcrf.gov/content.jsp?file=DAVID_API.html

不是生物信息学的朋友,可以重点关注分析思路。
以Entrez gene id为1002的基因为例,返回GO,interpro等注释信息的api格式为

http://david.abcc.ncifcrf.gov/api.jsp?type=ENTREZ_GENE_ID&ids=1002,&tool=annotationReport&annot=GOTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT,INTERPRO,PIR_SUPERFAMILY,SMART,BBID,BIOCARTA,KEGG_PATHWAY,COG_ONTOLOGY,SP_PIR_KEYWORDS,UP_SEQ_FEATURE,GENETIC_ASSOCIATION_DB_DISEASE,OMIM_DISEASE

用python应用urllib2的包,抓取上述网页的结果为


继续阅读