Continuous Long Read, subreads and scraps in Pacbio Raw data

Pacbio的工具更新实在是太快了,https://github.com/PacificBiosciences/pbbioconda

原来Isoseq的数据处理之后还叫high quality reads,现在叫HiFi reads,而且Pacbio的工具包有这么个倾向,把Isoseq分析过程中用的工具都囊括在内,比如后续的比对和collapse的工作。在tofu停了之后,cDNA_Cupcake也顶上来了。下面是几个名词解释,CLR, CCS, subreads, scrapes等,在处理原始的raw data时会碰到。

The contiguous sequence generated by the polymerase during sequencing is referred to as a “polymerase read” or a Continuous Long Read (CLR).

This CLR read may include sequence from adapters and multiple copies of inserts, because it traverses the circular template many times. The CLRs are processed to remove adapter sequences and to retain only the insert sequence, called “subreads”.

All other sequences sequenced from the CLR are called “scraps”.

Multiple copies of subreads generated from the single SMRTBell can then be collapsed to a single, high-quality sequence, called the “read of insert” ROI or Circular Consensus Sequence (CCS).

关于scrapes和subreads的bam文件的区别,只用subreads即可
The scraps file includes all the sequence not in the HQ (high quality) region, and the adapter sequences. The old bax format saved everything with indexes into the usable data. The subreads.bam contains all the usable data.

.subreads.bam A file contain usable reads without adapter;The Sequel System outputs one subreads.bam file per collection/cell, which contains unaligned base calls from high-quality regions.
.scraps.bam A file contain sequence data outside of the High Quality region, rejected subreads,excised adapter and possible barcode sequences, as well as spike-in control sequences. (The basecaller marks regions of single molecule sequence activity as high-quality.)

Ref:
https://www.zxzyl.com/archives/1044

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03751-8

http://seqanswers.com/forums/showpost.php?p=202171&postcount=9

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据