分类目录归档:Copyrighted

Continuous Long Read, subreads and scraps in Pacbio Raw data

Pacbio的工具更新实在是太快了,https://github.com/PacificBiosciences/pbbioconda

原来Isoseq的数据处理之后还叫high quality reads,现在叫HiFi reads,而且Pacbio的工具包有这么个倾向,把Isoseq分析过程中用的工具都囊括在内,比如后续的比对和collapse的工作。在tofu停了之后,cDNA_Cupcake也顶上来了。下面是几个名词解释,CLR, CCS, subreads, scrapes等,在处理原始的raw data时会碰到。

The contiguous sequence generated by the polymerase during sequencing is referred to as a “polymerase read” or a Continuous Long Read (CLR).

This CLR read may include sequence from adapters and multiple copies of inserts, because it traverses the circular template many times. The CLRs are processed to remove adapter sequences and to retain only the insert sequence, called “subreads”.

All other sequences sequenced from the CLR are called “scraps”.

Multiple copies of subreads generated from the single SMRTBell can then be collapsed to a single, high-quality sequence, called the “read of insert” ROI or Circular Consensus Sequence (CCS).

关于scrapes和subreads的bam文件的区别,只用subreads即可
The scraps file includes all the sequence not in the HQ (high quality) region, and the adapter sequences. The old bax format saved everything with indexes into the usable data. The subreads.bam contains all the usable data.

.subreads.bam A file contain usable reads without adapter;The Sequel System outputs one subreads.bam file per collection/cell, which contains unaligned base calls from high-quality regions.
.scraps.bam A file contain sequence data outside of the High Quality region, rejected subreads,excised adapter and possible barcode sequences, as well as spike-in control sequences. (The basecaller marks regions of single molecule sequence activity as high-quality.)

Ref:
http://www.zxzyl.com/archives/1044

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03751-8

http://seqanswers.com/forums/showpost.php?p=202171&postcount=9

谈一谈在变异解读过程中用到的几个不太熟悉的预测指标

帅旸谈一谈在变异解读过程中用到的几个不太熟悉的预测指标:

z score

z score:这个指标指的是某个基因对missense的耐受程度,具体是指该基因所期望的missense数比上观察
到的missense数,如果z score>3.09,则认为该基因对missense不耐受,根据公式我们可以看出如果比值越大,则基因对missense越不耐受。利用z score可以在我们使用ACMG指南PP2的时候使用。

REVEL score

REVEL score:ClinGen SVI建议使用REVEL用来预测missense致病性。与其他常用missense致病性预测软件不同,REVEL整合了包括SIFT、PolyPhen、GERP++在内的13个软件的预测结果,对罕见变异的预测结果更加出色。当REVEL score>0.75,<0.15时分别使用ACMG指南PP3和BP4。

GERP++

GERP++ rejected substitutions” (RS) score:GERP++从基因进化速率角度预测位点保守性,具体是指该基因位点所期望的碱基替换次数减去观察到的碱基替换次数,可见分数越大,该位点保守性较强,当GERP++ RS score>6.8时,认为该位点保守。当分析一个不影响剪切的同义突变时,如果RS score<6.8,则可以使用ACMG指南BP7。

dbscSNV score

dbscSNV score:dbscSNV含有两个不同的算法,用来预测变异是否影响截切,一个是基于adaptive boostin,一个是基于Random Forest。当两种算法得分均小于0.6时,则认为不影响剪切。

参考

ClinGen及Zhang, J., Yao, Y., He, H., & Shen, J. (2020). Clinical Interpretation of Sequence Variants. Current Protocols in Human Genetics, 106(1), e98.

其他参数如pLI及 Haploinsufficiency score我们之前已经介绍过,请点击下文链接

http://www.zxzyl.com/archives/940

#####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: 帅旸
#####################################################################

推荐两个Rstudio Addins

datapasta

https://github.com/MilesMcBain/datapasta/

还在手工的把excel的数据写成导到R里吗。不管横着还是竖着复制数据,datapasta可以自动、快速的把复制数据转成tibbles, data.frames, 或者 vectors格式。

更详细的参考https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html

styler

https://github.com/r-lib/styler

还嫌自己写出的代码不够美吗,styler 可以把代码格式化成tidyverse规则的风格。

The goal of styler is to provide non-invasive pretty-printing of R source code while adhering to the tidyverse formatting rules. styler can be customized to format code according to other style guides too.

真香

links

https://github.com/MilesMcBain/datapasta/
https://github.com/r-lib/styler
https://github.com/daattali/addinslist
https://www.zhihu.com/question/398418315

最后推荐两个可以将ggplot导出成ppt的R包

1,export的graph2ppt函数

https://github.com/tomwenseleers/export

export虽然从CRAN下架了,但依然可以通过github的库来安装,devtools::install_github(“tomwenseleers/export”)

2,eoffice的topptx函数

https://github.com/guokai8/eoffice

3, officer

https://github.com/davidgohel/officer

用法可以参考:https://www.brodrigues.co/blog/2018-10-05-ggplot2_purrr_officer/

Map NM ID to Gene Symbol

新年快乐,21年的第一篇文章。

以前写过映射ENSEMBL ID 和 NCBI ID, http://www.zxzyl.com/archives/736

日常分析中,我们也会经常遇到其他的ID mapping的工作,这种工作不是基因ID转基因ID,而是转录本的ID转基因ID。

如果用的是refGene的注释,最简单了,直接用下面的命令即可

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e "select name,name2 from refGene"

不过我也经常通过解析gtf文件获得,因为gtf有转本的ID,也有基因的symbol或者ID,只要有gtf文件就可以提取。本着不造轮子的精神,我利用的是现成的R包

 
library(plyranges)
gr <- read_gff("/path/to/gtf/or/gff") %>% select(transcript_id, gene_id, gene_name)
gr <- unique(data.frame(gr))

#####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
#####################################################################

Parse gtf

I always use gtf file and retrieve gene information. There isn’t a highly flexible tool to solve my demand. I modified the code from “https://github.com/Jverma/GFF-Parser”, thanks Jverma. This tool will be easier to use.

/wp/f4w/2020/2020-10-02-gtf-parser.gif

Usage

Basically, there are three parameters.

id: either transcript id or gene id.

attType: attribute defined in gtf file. E.g. feature (column 3), gene_name, exon_number, transcript_id in column 9

attValue: the attribute value you want to search for.

>>> import sys
>>> from gtfParser import gtfParser

>>> gtf = gtfParser("example.gtf")

>>> # Get all exons in CDK7
>>> gtf.getRecordsByID("CDK7", "feature", "exon")

>>> # Get all features of transcript_id defined as "NM_001324069" in gene "CDK7"
>>> gtf.getRecordsByID("CDK7", "transcript_id", "NM_001324069")

>>> # Get start codon where feature was defined as "start_codon" in transcript "NM_001324069"
>>> gtf.getRecordsByID("NM_001324069", "feature", "start_codon")

>>> # Get a exon where its id is "NM_001324078.1" in "NM_001324078" transcript
>>> gtf.getRecordsByID("NM_001324078", "exon_id", "NM_001324078.1")

# Example gtf
Here is an simple example of gtf file. You can use to test. A subset from refSeq.hg38.gtf.

继续阅读