SAM/BAM文件中的MD标签

我用bowtie比对了序列,想查看reads的错配情况。SAM flag中的M包含了比对上的碱基和错位的碱基,不能区分错配。

参考bowtie的文档,可以看到XM的标签可以指示mismatch的个数,MD标签可以查看具体的错配情况。

XN:i:<N> The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read.
XM:i:<N> The number of mismatches in the alignment. Only present if SAM record is for an aligned read.
MD:Z:<S> A string representation of the mismatched reference bases in the alignment. See SAM Tags format specification for details. Only present if SAM record is for an aligned read.

SAM手册对与MD的介绍

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
String encoding mismatched and deleted reference bases, used in conjunction with the CIGAR and SEQ fields to reconstruct the bases of the reference sequence interval to which the alignment has been mapped. This can enable variant calling without requiring access to the entire original reference.
编码错配或del的字符串,与CIGAR和SEQ一起使用,重建read的比对情况。可以在不需要整个参考序列的情况下,用于突变检测。

The MD string consists of the following items, concatenated without additional delimiter characters:
MD字符串由以下不含分隔符的条目组成。

• [0-9]+, indicating a run of reference bases that are identical to the corresponding SEQ bases;表示与相应SEQ碱基相同的一系列参考碱基;这里[0-9]+是正则表示。
• [A-Z], identifying a single reference base that differs from the SEQ base aligned at that position;在这个位置SEQ碱基与参考碱基不一致
• \^[A-Z]+, identifying a run of reference bases that have been deleted in the alignment.以^开头,表明有个del,注意del前有^。

总结:数字表示匹配,碱基表示错配,^碱基表示del。

As shown in the complete regular expression above, numbers alternate with the other items. Thus if two mismatches or deletions are adjacent without a run of identical bases between them, a ‘0’ (indicating a 0-length run) must be used to separate them in the MD string. 数字和字符交替出现,如果两个连续错配,中间需要用0来隔开。

Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations) are not represented in the MD string. When reconstructing the reference sequence, inserted and soft-clipped SEQ bases are omitted as determined by tracking ‘I’ and ‘S’ operations in the CIGAR string. (If the CIGAR string contains ‘N’ operations, then the corresponding skipped parts of the reference sequence cannot be reconstructed.)
Clipping, padding, reference skips, and insertions (‘H’, ‘S’, ‘P’, ‘N’, and ‘I’ CIGAR operations)不体现在MD字符串中,所以要与CIGAR结合。

For example, a string ‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches.
10A5^AC6表示10个匹配(10),一个与A不匹配(A),2bp的del(^AC),6碱基的匹配。

JC-单细胞转录组分析揭示人类子宫内膜癌的起源和病理过程

Single-cell transcriptomic analysis highlights origin and pathological process of human endometrioid endometrial carcinoma

https://www.nature.com/articles/s41467-022-33982-7

背景

子宫内膜癌(Endometrial cancer, EC)是妇科最常见的恶性肿瘤之一,子宫内膜样子宫内膜癌(endometrioid endomecancer, EEC)是EC的主要病理类型。

在雌激素依赖性EEC肿瘤发生过程中,子宫内膜在没有孕激素保护的情况下长期暴露于雌激素中,表现出不受控制的增殖,并且可以从正常子宫内膜发展到非典型子宫内膜增生(AEH, EEC癌前阶段),然后逐步发展到EEC。关于ECC的起源过往研究推测包括子宫内膜上皮和基质干成分在内的多种谱系可能是EEC的起源,但证据不足以支持明确起源。

肿瘤微环境由免疫细胞、成纤维细胞、周细胞等组成,在肿瘤的发生、预后和转移中起重要作用,尽管先前的研究已经提示肿瘤微环境在预后和治疗耐药的潜在作用,但从正常子宫内膜到EEC形成的过程仍不明确。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
graph TB
    A[子宫内膜及非典型子宫内膜增生AEH及子宫内膜样子宫内膜癌ECC] --> B[细胞分群]
    B --> C[上皮细胞比例在AEH中增加,在EEC中进一步扩大,CNV变化大]
    B --> D[间质成纤维细胞比例下降]
    C --> F[RNA velocity分析,细胞相似性分析,MET marer基因分析等表明ECC不来源CAF]
    D --> F
    F --> G[EEC的上皮聚类,发现AEH特有非纤毛上皮亚群,并在ECC中存在,推测来源于正常的非纤毛上皮]    
    G --> H[EEC上皮细胞独有亚群为致癌亚群,RNA velocity分析非纤毛上皮腺细胞有可能是致癌亚群中存在的来源]
    H --> I[致癌亚群特征基因,发现LCN2和SAA1/2可能是子宫内膜早期肿瘤发生的一个特征]
    I --> J[类器官实验证明在正常子宫内膜和EEC中成纤维细胞是不可缺少的]
    J --> K[类器官实验证明在正常子宫内膜和EEC中成纤维细胞是不可缺少的]
    K --> L[巨噬细胞和淋巴细胞亚群分析表明免疫环境的失调可导致子宫内膜肿瘤的发生]

非编码小RNA的fasta序列下载资源

snoRNA snRNA https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz piRNA piRDB https://www.pirnadb.org/download/archive piRBase http://bigdata.ibp.ac.cn/piRBase/download/v3.0/fasta/hsa.v3.0.fa.gz piRNA Bank http://pirnabank.ibab.ac.in/ tRNA GtRNAdb high confidence mature tRNA sequences http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-mature-tRNAs.fa mitocondrial tRNA sequences from mitotRNAdb http://mttrna.bioinf.uni-leipzig.de/mtDataOutput/ miRNA https://mirbase.org/download/ yRNA 18S (NR_145820.1), 5S (NR_023363.1), 28S (NR_003287.4) and 5.8S (NR_145821.1); RNY1 (NR_004391.1), RNY3 (NR_004392.1), RNY4 (NR_004393.1) and RNY5 (NR_001571.2) rRNA https://www.ncbi.nlm.nih.gov/nucleotide?term=txid9606[Organism] 选择rRNA下载

recount3超大规模可用转录组数据集

随着测序数据的积累,如何复用这些数据是一个挑战。recount项目,目前是recount3,共收集了8,679人和10,088小鼠的数据集,超过750,000个样本,经过统一处理(uniformly processed),得到gene或exon的表达以及exon-exon junction的数据。好多年前,我就了解过recount项目,很奇怪很少有介绍的。

一,recount对所有属于进行了uniformly processed,避免了分析流程的bias;

二,recount提供了超大规模的预处理之后的数据,直接拿来用,避免研究人员从原始数据分析;

三,recount提供了简单易用的工具,方便研究人员下载和处理数据。

方法1:下载TCGA-OV为例,检索过滤然后下载

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
library(recount3)

# 同 https://jhubiostatistics.shinyapps.io/recount3-study-explorer/
# 可以看到project_home和project,包括TCGA,GTEX和SRA
human_projects <- available_projects()

proj_info <- subset(
  human_projects,
  project == "OV" & project_type == "data_sources"
)

rse_gene_tcga_ov <- create_rse(proj_info)

#counts data
assays(rse_gene_tcga_ov)$counts <- transform_counts(rse_gene_tcga_ov)
# ## Compute TPMs
assays(rse_gene_tcga_ov)$TPM <- recount::getTPM(rse_gene_tcga_ov, length_var = "score")
# ## Check TPM. Should all be equal to 1
colSums(assay(rse_gene_tcga_ov, "TPM")) / 1e6 


# View sample annotation
View(data.frame(colData(rse_gene_tcga_ov)))

# View gene annotation
View(data.frame(exp$tcga.ov.expr@rowRanges))

方法2:直接选中数据集,生成下载code

在这个网站选中想要下载的数据集,https://jhubiostatistics.shinyapps.io/recount3-study-explorer/,网站下方会显示下载的code。注释不一定是26,还可以是RefSeq v109,Gencode v29等。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
rse_gene_tcga_ov = recount3::create_rse_manual(
    project = "OV",
    project_home = "data_sources/tcga",
    organism = "human",
    annotation = "gencode_v26",
    type = "gene"
)

#counts data
assays(rse_gene_tcga_ov)$counts <- transform_counts(rse_gene_tcga_ov)
# ## Compute TPMs
assays(rse_gene_tcga_ov)$TPM <- recount::getTPM(rse_gene_tcga_ov, length_var = "score")
# ## Check TPM. Should all be equal to 1
colSums(assay(rse_gene_tcga_ov, "TPM")) / 1e6

Kingfisher下载SRA数据

我知道一个SRP的编号,里面有我想要下载的数据,我想根据SRP编号快速下载数据,查到了Kingfisher这个工具。

https://github.com/wwood/kingfisher-download

文档:https://wwood.github.io/kingfisher-download/

安装:pip install kingfisher

主要有三个模块,get、extract、annotate

get

1
2
3
4
5
6
kingfisher get -r ERR1739691 -m ena-ascp
# 可以指定列表--run-identifiers-list
# 指定project编号SRP,--bioprojects
# 指定下载方法,--download-methods,可以指定多种,程序会一个方法一个方法的试,包括ena-ascp,ena-ftp,prefetch,aws-http,aws-cp,gcp-cp
# 指定线程数目,--download-threads
# 指定ascp需要的key路径,--ascp-ssh-key