解决:提取Biosample的信息,将Biosample Id转换为SRA Run ID。

我有一个NCBI的biosmaple ID, 比如SAMN02324197,我不知道Bio project的情况下,我想知道他的SRR Run ID。最简单的办法是直接在SRA中搜这个biosample的ID,网页中Project,Run的信息都有了。

不过我如果有上百个biosample,来自不同的project,总不能一个一个的查吧。我先是在biostar上的这个https://www.biostars.org/p/97782/看到了EDirect可以查GSE的样本,我就在想能不能查sra的。

我试了一下这个命令,能检索到

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> esearch -db sra -query "SAMN02324197"

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>MCID_6243c437ec6f7a20fc0f452a</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

但是我想提取相关的信息呢,添加了efetch即可

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
> esearch -db sra -query "SAMN02324197" | \
         efetch -format docsum

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
  <DbBuild>Build220329-1838m.1</DbBuild>
  <DocumentSummary>
    <Id>477293</Id>
    <ExpXml>
      <Summary>
        <Title>Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300</Title>
        <Platform instrument_model="Illumina HiSeq 2000">ILLUMINA</Platform>
        <Statistics total_runs="1" total_spots="195798261" total_bases="9789913050" total_size="7003530482" load_done="true" cluster_name="public"/>
      </Summary>
      <Submitter acc="SRA096086" center_name="York University" contact_name="Amro Zayed" lab_name="Zayed Lab"/>
      <Experiment acc="SRX339524" ver="2" status="public" name="Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300"/>
      <Study acc="SRP029219" name="Apis mellifera population genomics project"/>
      <Organism taxid="7460" ScientificName="Apis mellifera"/>
      <Sample acc="SRS473244" name=""/>
      <Instrument ILLUMINA="Illumina HiSeq 2000"/>
      <Library_descriptor>
        <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
        <LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
        <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
        <LIBRARY_LAYOUT>
          <SINGLE/>
        </LIBRARY_LAYOUT>
      </Library_descriptor>
      <Bioproject>PRJNA216922</Bioproject>
      <Biosample>SAMN02324197</Biosample>
    </ExpXml>
    <Runs>
      <Run acc="SRR957095" total_spots="195798261" total_bases="9789913050" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
    </Runs>
    <ExtLinks/>
    <CreateDate>2014/02/05</CreateDate>
    <UpdateDate>2013/08/26</UpdateDate>
  </DocumentSummary>
</DocumentSummarySet>

这个时候能看到想要的信息都在这里面了,然后试着提取相关的信息

1
2
3
4
5
> esearch -db sra -query "SAMN02324197" | \
         efetch -format docsum | \
         xtract -pattern DocumentSummary -element Bioproject,Biosample,Title

PRJNA216922     SAMN02324197    Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300

但这个时候SRR的ID死活提取不来,翻了NCBI的手册看到可以通过block和@提取,这样相关的biosample对应的Run ID信息就有了。

1
2
3
4
5
> esearch -db sra -query "SAMN02324197" | \
         efetch -format docsum | \
         xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc

PRJNA216922     SAMN02324197    Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300       SRR957095

批量查询就变得很简单了

1
2
3
4
5
6
> cat biosample.ids.list| head -4 | xargs -L 1 -i sh -c "esearch -db sra -query {} | efetch -format docsum | xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc"

PRJNA418874     SAMN08038389    Resequencing of Apis cerana: adult female body  SRR6301311
PRJNA418874     SAMN08038319    Resequencing of Apis cerana: adult female body  SRR6301296
PRJNA418874     SAMN08038413    Resequencing of Apis cerana: adult female body  SRR6301438
PRJNA418874     SAMN08038427    Resequencing of Apis cerana: adult female body  SRR6301396

参考

https://www.ncbi.nlm.nih.gov/books/NBK179288/

https://www.biostars.org/p/97782/

####################################################################

#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任

#Author: Jason

#####################################################################