解决:提取Biosample的信息,将Biosample Id转换为SRA Run ID。
我有一个NCBI的biosmaple ID, 比如SAMN02324197,我不知道Bio project的情况下,我想知道他的SRR Run ID。最简单的办法是直接在SRA中搜这个biosample的ID,网页中Project,Run的信息都有了。
不过我如果有上百个biosample,来自不同的project,总不能一个一个的查吧。我先是在biostar上的这个https://www.biostars.org/p/97782/看到了EDirect可以查GSE的样本,我就在想能不能查sra的。
我试了一下这个命令,能检索到
1
2
3
4
5
6
7
8
9
10
|
> esearch -db sra -query "SAMN02324197"
<ENTREZ_DIRECT>
<Db>sra</Db>
<WebEnv>MCID_6243c437ec6f7a20fc0f452a</WebEnv>
<QueryKey>1</QueryKey>
<Count>1</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
|
但是我想提取相关的信息呢,添加了efetch即可
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
> esearch -db sra -query "SAMN02324197" | \
efetch -format docsum
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
<DbBuild>Build220329-1838m.1</DbBuild>
<DocumentSummary>
<Id>477293</Id>
<ExpXml>
<Summary>
<Title>Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300</Title>
<Platform instrument_model="Illumina HiSeq 2000">ILLUMINA</Platform>
<Statistics total_runs="1" total_spots="195798261" total_bases="9789913050" total_size="7003530482" load_done="true" cluster_name="public"/>
</Summary>
<Submitter acc="SRA096086" center_name="York University" contact_name="Amro Zayed" lab_name="Zayed Lab"/>
<Experiment acc="SRX339524" ver="2" status="public" name="Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300"/>
<Study acc="SRP029219" name="Apis mellifera population genomics project"/>
<Organism taxid="7460" ScientificName="Apis mellifera"/>
<Sample acc="SRS473244" name=""/>
<Instrument ILLUMINA="Illumina HiSeq 2000"/>
<Library_descriptor>
<LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<SINGLE/>
</LIBRARY_LAYOUT>
</Library_descriptor>
<Bioproject>PRJNA216922</Bioproject>
<Biosample>SAMN02324197</Biosample>
</ExpXml>
<Runs>
<Run acc="SRR957095" total_spots="195798261" total_bases="9789913050" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
</Runs>
<ExtLinks/>
<CreateDate>2014/02/05</CreateDate>
<UpdateDate>2013/08/26</UpdateDate>
</DocumentSummary>
</DocumentSummarySet>
|
这个时候能看到想要的信息都在这里面了,然后试着提取相关的信息
1
2
3
4
5
|
> esearch -db sra -query "SAMN02324197" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Bioproject,Biosample,Title
PRJNA216922 SAMN02324197 Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300
|
但这个时候SRR的ID死活提取不来,翻了NCBI的手册看到可以通过block和@提取,这样相关的biosample对应的Run ID信息就有了。
1
2
3
4
5
|
> esearch -db sra -query "SAMN02324197" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc
PRJNA216922 SAMN02324197 Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300 SRR957095
|
批量查询就变得很简单了
1
2
3
4
5
6
|
> cat biosample.ids.list| head -4 | xargs -L 1 -i sh -c "esearch -db sra -query {} | efetch -format docsum | xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc"
PRJNA418874 SAMN08038389 Resequencing of Apis cerana: adult female body SRR6301311
PRJNA418874 SAMN08038319 Resequencing of Apis cerana: adult female body SRR6301296
PRJNA418874 SAMN08038413 Resequencing of Apis cerana: adult female body SRR6301438
PRJNA418874 SAMN08038427 Resequencing of Apis cerana: adult female body SRR6301396
|
后续使用过程中,遇到一个biosampe对应多个run的,如果不处理的话,每个run会占用一列,可以使用sep指定分隔符。
1
2
3
4
|
efetch -db sra -format docsum -id SRP043122 | xtract -pattern DocumentSummary -element Bioproject,Study@acc,Study@name,Sample@acc,Sample@name,Title,Platform@instrument_model,LIBRARY_CONSTRUCTION_PROTOCOL -block Runs -sep "," -element Run@acc
PRJNA252516 SRP043122 Exosomal miR-1290 and miR-375 as prognostic markers in castrate resistant prostate cancer SRS634616 GSM1410446: PC8; Homo sapiens; ncRNA-Seq Illumina HiSeq 2000 Exosome isolation and RNA extraction was reported previously. The plasma exosomes were isolated using ExoQuick Solution (SBI, Mountain View, CA). RNA was extracted with the miRNeasy Micro Kit (QIAGEN, Valencia, CA). RNA quantity was determined by a fluorometer (Qubit 2.0, Life Technologies, Carlsbad, CA). Agilent Small RNA Chip (Agilent Technologies, Santa Clara, CA) was used to determine RNA quality. Small RNA libraries were constructed with 2ng exosomal RNA from each patient as starting material following the instruction of the NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB). Absolute quantities of each indexed library were determined by real-time qPCR using KAPA Library Quantification Kit. Equally 2nM of each index library was pooled and submitted for sequencing using the Illumina HiSeq2000 platform. SRR1379721,SRR1379722,SRR1379723,SRR1379724
PRJNA252516 SRP043122 Exosomal miR-1290 and miR-375 as prognostic markers in castrate resistant prostate cancer SRS634617 GSM1410447: PC9; Homo sapiens; ncRNA-Seq Illumina HiSeq 2000 Exosome isolation and RNA extraction was reported previously. The plasma exosomes were isolated using ExoQuick Solution (SBI, Mountain View, CA). RNA was extracted with the miRNeasy Micro Kit (QIAGEN, Valencia, CA). RNA quantity was determined by a fluorometer (Qubit 2.0, Life Technologies, Carlsbad, CA). Agilent Small RNA Chip (Agilent Technologies, Santa Clara, CA) was used to determine RNA quality. Small RNA libraries were constructed with 2ng exosomal RNA from each patient as starting material following the instruction of the NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB). Absolute quantities of each indexed library were determined by real-time qPCR using KAPA Library Quantification Kit. Equally 2nM of each index library was pooled and submitted for sequencing using the Illumina HiSeq2000 platform. SRR1379725,SRR1379726,SRR1379727
|
参考
https://www.ncbi.nlm.nih.gov/books/NBK179288/
https://www.biostars.org/p/97782/
####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
#####################################################################