Python抓取动态网页

生物信息学中,David(the Database for Annotation,Visualization and Integrated Discovery)是常用的注视工具,可以对基因进行GO注释,KEGG pathway注释等,David提供接口供批量注释调用。

David的网址https://david.ncifcrf.gov/,api介绍https://david.ncifcrf.gov/content.jsp?file=DAVID_API.html

不是生物信息学的朋友,可以重点关注分析思路。 以Entrez gene id为1002的基因为例,返回GO,interpro等注释信息的api格式为

http://david.abcc.ncifcrf.gov/api.jsp?type=ENTREZ_GENE_ID&ids=1002,&tool=annotationReport&annot=GOTERM_BP_FAT,GOTERM_CC_FAT,GOTERM_MF_FAT,INTERPRO,PIR_SUPERFAMILY,SMART,BBID,BIOCARTA,KEGG_PATHWAY,COG_ONTOLOGY,SP_PIR_KEYWORDS,UP_SEQ_FEATURE,GENETIC_ASSOCIATION_DB_DISEASE,OMIM_DISEASE

用python应用urllib2的包,抓取上述网页的结果为

三代测序组拼接组装工具Falcon

基因组装配工具Falcon工作流程

1 Falcon简介

Falcon (Fast Alignment and CONsensus),是由PacBio(太平洋生物科技公司)新开发的二倍体基因组从头拼接组装工具,由HGAP(Hierarchical Genome Assembly Process)扩展而来,但拥有更快的拼接组装效率。 Falcon的正常运行,需要DAZZ_DB模块用来构建序列的数据库,DALIGNER模块进行序列比对寻找序列之间的重叠和pypeFLOW模块记录和追踪流程进度。

Update LinuxRHEL yum source to 163 mirror--更新yum源为163的镜像

open source mirror of 163: http://mirrors.163.com/

as a root user

1
2
3
4
cd /ect/yum.repos.d 
wget http://mirrors.163.com/.help/CentOS5-Base-163.repo
yum clean all
yum makecache

But an error occurs:

1
2
> Timeout on [http://mirrors.163.com/centos/6Server/os/x86_64/repodata/repomd.xml](http://mirrors.163.com/centos/6Server/os/x86_64/repodata/repomd.xml "http://mirrors.163.com/centos/6Server/os/x86_64/repodata/repomd.xml")
> PYCURL ERROR 22- The requested URL returned errer: 404

I can’t find this directory in 163’s webserver (but I do find http://mirrors.163.com/centos/6/os/x86_64/repodata/repomd.xml), so I guess some thing wrong in .repo file.

As for me, I changed all

1
baseurl=http://mirrors.163.com/centos/$releasever

to

1
baseurl=http://mirrors.163.com/centos/6