标签归档:UCSC

一步到位下载hg19基因组文件

hg19对应GRCh37,UCSC提供hg19的参考基因组下载。UCSC的下载地址在ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

需要经过下载每个染色体,然后解压合并成一个整个的基因组文件
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/

其实这样有点浪费时间,还要考虑合并的时候染色体的顺序是否按照1,2,3而不是1,10,11排下来的。目前我知道的最简单的办法的,从GATK bundle中下载。比如hg19整个基因组的文件。下面是一步到位的命令,包括了fasta,fai,dict文件。

wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19*

GATK bundle还提供一下其他文件,可以看看( ftp://ftp.broadinstitute.org/bundle ),比如dict文件,hg38文件等。当然构建参考基因组不一定非要合并染色体,个人习惯。

补充:UCSC也提供一个genome.fa的下载,ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

#####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
#####################################################################

Get the reference allele based on genomic position

This post will show how to get the reference base of chr1 from 49999 to 500001 (Version: hg19).
Please note: different tools has different coordinate (0 start or 1 start).

1, SAMtools

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create .fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.

$samtools faidx ucsc.hg19.fasta chr1:49999-50001
>chr1:49999-50001
ATA

2, twoBitToFa

$wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
$wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/twoBitToFa
$chmod 755 twoBitToFa faToTwoBit 
$faToTwoBit ucsc.hg19.fasta ucsc.hg19.2bit
$twoBitToFa ucsc.hg19.2bit -seq=chr1 -start=49998 -end=50001 temp.out && cat temp.out && rm temp.out
>chr1:49998-50001
ATA

3, UCSC DAS server

$wget -qO- http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:49999,50001 | grep -v '<'
ata

继续阅读