PyCharm启动时报failed to create jvm error code 4

2016-01-18

pycharm failed to create JVM with error code 4 when launching pycharm. In a ordinary way, you can decrease -Xmx and -XX option value in $IDE_HOMEBINidea.exe.vmoptions. Another way, when pycharm invoking JVM, pycharm depends on its own vm options in VM optioins configuraion file. VM options can be set in JetBrainsPyCharm Community Edition 3.4.1bin. In the direcotry, there is a file named by pycharm.exe.vmoptions. You can decrease the -XX and

CSV格式与tab制表符分割的格式文件相互转换，支持管道操作

2016-01-15

Default Category

Annovar的注释结果，如果输出制表符分割的VCF格式，显得混乱。如果输出为csv格式，方便windows下的用户用excel打开，但不方便数据处理，比如某一列的注释信息中包含了逗号，这种情况就要特别注意。python中有csv模块可以方便的读取csv，推荐使用。

本文写的小脚本主要处理简单的csv格式，亮点在于支持接收标准输入和标准输出，方便生物信息多命令之间通过管道进行处理。如果没有指定输入文件，则读取管道流数据，如果没有指定输出文件，则可以用管道接收数据进行下一步处理。

比如 cat xxx.csv ' python convert.py ' grep "xx" > result.txt 或者 python convert.py -i input.csv ' grep "xx" > result.txt 或者 python convert.py -i input.csv - o result.txt 查看用法 python convert.py --help

统计覆盖到某一区域的reads数目和reads的GC含量

2016-01-12

Default Category

statistic GC content by interval. BED format file include arget interval information. BEDTool statistic read number and extract sequence, awk statistic GC content.

bedtools map -a interval.bed -b sample.bam -c 10,10 -o count,concat | awk -v OFS="t" '{n=length($5); gc=gsub("[gcGC]", "", $5); print $1,$2,$3,$4,gc/n}'

思路：利用bedtools的map工具，首先找到map到interval.bed中的每个interval的reads的序列，然后统计这些序列中有多少GC。

预测突变危害的SIFT和PolyPhen2数据库介绍--SIFT and polyphen2 introduction

2016-01-12

Default Category

SIFT分数介绍：

SIFT sorts intolerant from tolerant amino acid substitutions ：通过寻找近似的序列，进行比对，计算发生碱基替换的概率，小于0.05被认为是有害的。

SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. SIFT is a multistep procedure that (1) searches for similar sequences, (2) chooses closely related sequences that may share similar function to the query sequence , (3) obtains the alignment of these chosen sequences, and (4) calculates normalized probabilities for all possible substitutions from the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.

SAMtools自带的统计命令--idxstats、stat、flagstat、bedcov和depth命令

2016-01-11

Default Category

SAMtools不仅仅用来call snp。从samtools的软件名就能看出，是对SAM格式文件进行操作的工作，比如讲sam转成bam格式，index，rmdup等等。samtools结合linux命令比如grep，awk和SAM格式描述的flag，tag，亦是非常非常非常强大，比如根据flag过滤duplicate的reads，根据XA tag过滤multiple hit的reads。本文在此只介绍一下samtools的统计命令，能快速对bam文件进行各种统计。

samtools的自带的几种统计工具

**samtool idxstats **

检索和打印与输入文件相对应的index file里的统计信息，所以要对输入的bam文件进行index

reference sequence name, sequence length, # mapped reads and # unmapped reads chr1 249250621 4998344 1005 chr2 243199373 3020248 595 chr3 198022430 2418804 449

samtools bedcov

计算覆盖到每个区域的总碱基数目

chr start-1 end totalbase chr1 100000 1000000 1709228 chr2 2000000 65885852 64362582

**samtools depth **

计算每个位点的深度

#chr pos depth chr1 1 5 chr1 2 5

samtools flagstat

根据flag统计多少map的reads等信息 43444444 + 0 in total (QC-passed reads + QC-failed reads) 5863846 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 43431948 + 0 mapped (99.97%:-nan%) 37580598 + 0 paired in sequencing