基因组坐标转换工具-以BED文件为例，从hg19转换到hg38坐标

2018-02-11

分析时使用的基因组版本，可能会与其他来源数据所使用的基因组版本不一致，需要统一成同一个版本的坐标，才能方便下一步的分析。

常用的有NCBI的Remap在线服务和UCSC的liftover，其实还有很多，本文暂时总结部分工具的用法。以将APOA1的编码区坐标（利用UCSC的genome browser下载，或者下载该文件APOA1.bed）转换为例，从hg19转到hg38版本坐标上。需要注意的是，在使用的时候，需要注意是否支持对应的格式。

	类型	支持格式	地址	推荐指数
Liftover	在线	bed	http://genome.ucsc.edu/cgi-bin/hgLiftOver	一般
Liftover	本地	bed和gff	http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver	推荐
Remap	在线	hgvs，bed，gvf，gff，gtf，Text ASN.1，Binary ASN.1，UCSC Region和VCF	https://www.ncbi.nlm.nih.gov/genome/tools/remap	推荐
CrossMap	本地	SAM/BAM,，Wiggle/BigWig， bed， gff/gtf，VCF	http://crossmap.sourceforge.net/	推荐
picard	本地	interval和VCF	http://broadinstitute.github.io/picard/

hg19、GRCH37、b37、hs37d5介绍和区别

2018-02-09

Default Category

大家经常用UCSC的hg19和NCBI的GRCh37版本的，但还有其他的版本，比如b37，hg37d5，比如在分析NIST的genome in a bottle（GIAB）提供的bam数据时，就遇到了hg37d5的版本，在用GATK的时候会遇到b37版本。

GRCh37

Genome Reference Consortium(基因组参照序列联盟)，由英国Wellcome Trust Sanger研究中心（the Wellcome Trust Sanger Center）、华盛顿大学基因组中心（The Washington University Genome Center）、欧洲生物信息研究所（the European Bioinformatics Institute）和美国国家生物技术信息中心（NCBI）联合组成。

GRCH37版本发布之后，也会有小的更新，比如GRCh37.p2，大的更新比如由GRCh37升级到GRCh38，填补gap，修改部分序列，其目的是提供一个完整的基因组序列assemble。GRCh38已经在2013年发布，多数基因组数据库正在兼容或者更新到该版本。

该版本包含人类chr1到chr22，chrX，chrY，MT染色体以及

“unlocalized sequences”：知道来自哪条染色体但不知道具体位置的序列
“unplaced sequences”：知道来自人类基因组序列，但不知道与染色体的关系
“alternate loci”：来自基因组特定区域，代表该区域序列的多样性 “1” to “22”, “X”, “Y” and “MT"命名比较规范，ENSEMBL， genome browser， the NCBI dbSNP (in VCF files), the Sanger COSMIC (in VCF files),都依照该规范。

下载地址：ftp://ftp.ncbi.nih.gov/genomes/Homo_sapiens

BCL文件与BCL2FAFSTQ程序简介

2018-02-07

Default Category

BCL文件

测序产生的原始文件是BCL（binary base call）文件，测序仪在测序的时候，每个cycle都会测量编码不同颜色的通道强度，并确定最有可能的碱基类型。Real Time Analysis (RTA) 软件会将碱基类型和可信度（一个质量分数）。与FASTQ文件不同的是，BCL文件是实时产生，每个cycle的每个tile都会有一个对应文件，文件放在

<run directory>/Data/Intensities/BaseCalls/L<lane>/C<cycle>.1

文件的命名

s_<lane>_<tile>.bcl

PLOT | 目标区域的测序覆盖度作图

2018-02-01

Default Category

不论是目标区域测序还是外显子组测序还是全基因组测序之后，我们会关注目标区域在特定深度下的覆盖度。比如20X的比例，50X的比例。如果只计算特定深度的覆盖度，我们了解不到其他深度下的覆盖度情况。如果每个都列出来，又不直观，这个时候用图片来表示就非常直观了。

获取统计数据

我们需要知道在每个深度下碱基的比例，这个时候强大的bedtools就出场了。

bedtools coverage -sorted -hist -g genome.file  -b  samp.bam -a target_regions.bed ' grep ^all > samp.hist.all.txt

-hist是为了获取目标区域的总结信息，以all开头，输出每个每个深度下碱基的比例，所以后续用grep ^all来过滤。

-sorted可以节省内存，但需要通过-g跟genome file来指定染色体的顺序。

genome file可以通过awk -v OFS='\t' {‘print $1,$2’} genome.fa.fai > genome.file 得到。

假设我们得到了smp1.hist.all.txt、smp2.hist.all.txt、smp3.hist.all.txt、smp4.hist.all.txt、smp5.hist.all.txt、smp6.hist.all.txt、smp7.hist.all.txt、

作图

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


# Get a list of the bedtools output files you'd like to read in

print(labs <- paste("", gsub(".hist.all.txt", "", files, perl=TRUE), sep=""))


# Create lists to hold coverage and cumulative coverage for each alignment,
# and read the data into these lists.
cov <- list()
cov_cumul <- list()
for (i in 1:length(files)) {
    cov[[i]] <- read.table(files[i])
    # The value should be 1 at 0X.
    cov_cumul[[i]] <- 1-cumsum(c(0,cov[[i]][,5]))
}

library(RColorBrewer)
cols <- brewer.pal(length(cov), "Dark2")

# Save the graph to a file
png("target-coverage-plots.png", h=1000, w=1000, pointsize=20)

# Create plot area, but do not plot anything. Add gridlines and axis labels.
plot(cov[[1]][2:401, 2], cov_cumul[[1]][1:400], type='n', xlab="Depth", ylab="Fraction of capture target bases \u2265 depth", ylim=c(0,1.0), main="Target Region Coverage")
abline(v = 20, col = "gray60")
abline(v = 50, col = "gray60")
abline(v = 80, col = "gray60")
abline(v = 100, col = "gray60")
abline(h = 0.50, col = "gray60")
abline(h = 0.90, col = "gray60")
axis(1, at=c(20,50,80), labels=c(20,50,80))
axis(2, at=c(0.90), labels=c(0.90))
axis(2, at=c(0.50), labels=c(0.50))

# Actually plot the data for each of the alignments (stored in the lists).
for (i in 1:length(cov)) points(cov[[i]][2:401, 2], cov_cumul[[i]][1:400], type='l', lwd=3, col=cols[i])

# Add a legend using the nice sample labeles rather than the full filenames.
legend("topright", legend=labs, col=cols, lty=1, lwd=4)

dev.off()

Illumina下机FASTQ文件命名规则

2018-02-01

Default Category

FASTQ文件在Illumina下机数据文件夹Data\Intensities\BaseCalls**中，类似SampleName_S1_L001_R1_001.fastq.gz（比如NTC_S11_L001_R1_001.fastq.gz）**

其中被下划线_分为了五个部分。

第一部分：SampleName，样本名，与上机时在Sample Sheet中填写的一致
第二部分：S1，S***，S后跟的数字与样本在Sample Sheet中的顺序一致，从1开始。不能分配到确定样本的read会归到S0（Undetermined_S0）
第三部分：L00*，泳道lane的编号
第四部分：R*，R1表示read1，R2表示read2。R1和R2为paired end reads。同一个样本的配对的FASTQ，只有这个地方不同
第五部分：001，通常为001

Naming

FASTQ files are named with the sample name and the sample number, which is a numeric assignment based on the order that the sample is listed in the sample sheet.

Example:

Data\Intensities\BaseCalls\SampleName_S1_L001_R1_001.fastq.gz

• SampleName-The sample name provided in the sample sheet. If a sample name is not provided, the file name includes the sample ID, which is a required field in the sample sheet and must be unique.

• S1-The sample number based on the order that samples are listed in the sample sheet starting with 1. In this example, S1 indicates that this sample is the first sample listed in the sample sheet.