解决:提取Biosample的信息,将Biosample Id转换为SRA Run ID。
我有一个NCBI的biosmaple ID, 比如SAMN02324197,我不知道Bio project的情况下,我想知道他的SRR Run ID。最简单的办法是直接在SRA中搜这个biosample的ID,网页中Project,Run的信息都有了。
不过我如果有上百个biosample,来自不同的project,总不能一个一个的查吧。我先是在biostar上的这个https://www.biostars.org/p/97782/看到了EDirect可以查GSE的样本,我就在想能不能查sra的。
我试了一下这个命令,能检索到
1
2
3
4
5
6
7
8
9
10
|
> esearch -db sra -query "SAMN02324197"
<ENTREZ_DIRECT>
<Db>sra</Db>
<WebEnv>MCID_6243c437ec6f7a20fc0f452a</WebEnv>
<QueryKey>1</QueryKey>
<Count>1</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
|
但是我想提取相关的信息呢,添加了efetch即可
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
> esearch -db sra -query "SAMN02324197" | \
efetch -format docsum
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
<DbBuild>Build220329-1838m.1</DbBuild>
<DocumentSummary>
<Id>477293</Id>
<ExpXml>
<Summary>
<Title>Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300</Title>
<Platform instrument_model="Illumina HiSeq 2000">ILLUMINA</Platform>
<Statistics total_runs="1" total_spots="195798261" total_bases="9789913050" total_size="7003530482" load_done="true" cluster_name="public"/>
</Summary>
<Submitter acc="SRA096086" center_name="York University" contact_name="Amro Zayed" lab_name="Zayed Lab"/>
<Experiment acc="SRX339524" ver="2" status="public" name="Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300"/>
<Study acc="SRP029219" name="Apis mellifera population genomics project"/>
<Organism taxid="7460" ScientificName="Apis mellifera"/>
<Sample acc="SRS473244" name=""/>
<Instrument ILLUMINA="Illumina HiSeq 2000"/>
<Library_descriptor>
<LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
<LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
<LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
<LIBRARY_LAYOUT>
<SINGLE/>
</LIBRARY_LAYOUT>
</Library_descriptor>
<Bioproject>PRJNA216922</Bioproject>
<Biosample>SAMN02324197</Biosample>
</ExpXml>
<Runs>
<Run acc="SRR957095" total_spots="195798261" total_bases="9789913050" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
</Runs>
<ExtLinks/>
<CreateDate>2014/02/05</CreateDate>
<UpdateDate>2013/08/26</UpdateDate>
</DocumentSummary>
</DocumentSummarySet>
|
这个时候能看到想要的信息都在这里面了,然后试着提取相关的信息
1
2
3
4
5
|
> esearch -db sra -query "SAMN02324197" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Bioproject,Biosample,Title
PRJNA216922 SAMN02324197 Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300
|
但这个时候SRR的ID死活提取不来,翻了NCBI的手册看到可以通过block和@提取,这样相关的biosample对应的Run ID信息就有了。
1
2
3
4
5
|
> esearch -db sra -query "SAMN02324197" | \
efetch -format docsum | \
xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc
PRJNA216922 SAMN02324197 Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300 SRR957095
|
批量查询就变得很简单了
1
2
3
4
5
6
|
> cat biosample.ids.list| head -4 | xargs -L 1 -i sh -c "esearch -db sra -query {} | efetch -format docsum | xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc"
PRJNA418874 SAMN08038389 Resequencing of Apis cerana: adult female body SRR6301311
PRJNA418874 SAMN08038319 Resequencing of Apis cerana: adult female body SRR6301296
PRJNA418874 SAMN08038413 Resequencing of Apis cerana: adult female body SRR6301438
PRJNA418874 SAMN08038427 Resequencing of Apis cerana: adult female body SRR6301396
|
参考
https://www.ncbi.nlm.nih.gov/books/NBK179288/
https://www.biostars.org/p/97782/
####################################################################
#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任
#Author: Jason
#####################################################################