解决:提取Biosample的信息,将Biosample Id转换为SRA Run ID。

我有一个NCBI的biosmaple ID, 比如SAMN02324197,我不知道Bio project的情况下,我想知道他的SRR Run ID。最简单的办法是直接在SRA中搜这个biosample的ID,网页中Project,Run的信息都有了。

不过我如果有上百个biosample,来自不同的project,总不能一个一个的查吧。我先是在biostar上的这个https://www.biostars.org/p/97782/看到了EDirect可以查GSE的样本,我就在想能不能查sra的。

我试了一下这个命令,能检索到

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
> esearch -db sra -query "SAMN02324197"

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>MCID_6243c437ec6f7a20fc0f452a</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

但是我想提取相关的信息呢,添加了efetch即可

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
> esearch -db sra -query "SAMN02324197" | \
         efetch -format docsum

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DocumentSummarySet>
<DocumentSummarySet status="OK">
  <DbBuild>Build220329-1838m.1</DbBuild>
  <DocumentSummary>
    <Id>477293</Id>
    <ExpXml>
      <Summary>
        <Title>Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300</Title>
        <Platform instrument_model="Illumina HiSeq 2000">ILLUMINA</Platform>
        <Statistics total_runs="1" total_spots="195798261" total_bases="9789913050" total_size="7003530482" load_done="true" cluster_name="public"/>
      </Summary>
      <Submitter acc="SRA096086" center_name="York University" contact_name="Amro Zayed" lab_name="Zayed Lab"/>
      <Experiment acc="SRX339524" ver="2" status="public" name="Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300"/>
      <Study acc="SRP029219" name="Apis mellifera population genomics project"/>
      <Organism taxid="7460" ScientificName="Apis mellifera"/>
      <Sample acc="SRS473244" name=""/>
      <Instrument ILLUMINA="Illumina HiSeq 2000"/>
      <Library_descriptor>
        <LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY>
        <LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE>
        <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION>
        <LIBRARY_LAYOUT>
          <SINGLE/>
        </LIBRARY_LAYOUT>
      </Library_descriptor>
      <Bioproject>PRJNA216922</Bioproject>
      <Biosample>SAMN02324197</Biosample>
    </ExpXml>
    <Runs>
      <Run acc="SRR957095" total_spots="195798261" total_bases="9789913050" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>
    </Runs>
    <ExtLinks/>
    <CreateDate>2014/02/05</CreateDate>
    <UpdateDate>2013/08/26</UpdateDate>
  </DocumentSummary>
</DocumentSummarySet>

这个时候能看到想要的信息都在这里面了,然后试着提取相关的信息

1
2
3
4
5
> esearch -db sra -query "SAMN02324197" | \
         efetch -format docsum | \
         xtract -pattern DocumentSummary -element Bioproject,Biosample,Title

PRJNA216922     SAMN02324197    Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300

但这个时候SRR的ID死活提取不来,翻了NCBI的手册看到可以通过block和@提取,这样相关的biosample对应的Run ID信息就有了。

1
2
3
4
5
> esearch -db sra -query "SAMN02324197" | \
         efetch -format docsum | \
         xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc

PRJNA216922     SAMN02324197    Genome sequence of Apis mellifera worker from the A lineage [Africa]-S300       SRR957095

批量查询就变得很简单了

1
2
3
4
5
6
> cat biosample.ids.list| head -4 | xargs -L 1 -i sh -c "esearch -db sra -query {} | efetch -format docsum | xtract -pattern DocumentSummary -element Bioproject,Biosample,Title -block Runs -element Run@acc"

PRJNA418874     SAMN08038389    Resequencing of Apis cerana: adult female body  SRR6301311
PRJNA418874     SAMN08038319    Resequencing of Apis cerana: adult female body  SRR6301296
PRJNA418874     SAMN08038413    Resequencing of Apis cerana: adult female body  SRR6301438
PRJNA418874     SAMN08038427    Resequencing of Apis cerana: adult female body  SRR6301396

后续使用过程中,遇到一个biosampe对应多个run的,如果不处理的话,每个run会占用一列,可以使用sep指定分隔符。

1
2
3
4
efetch -db sra -format docsum -id SRP043122 |  xtract -pattern DocumentSummary -element Bioproject,Study@acc,Study@name,Sample@acc,Sample@name,Title,Platform@instrument_model,LIBRARY_CONSTRUCTION_PROTOCOL -block Runs  -sep "," -element Run@acc

PRJNA252516     SRP043122       Exosomal miR-1290 and miR-375 as prognostic markers in castrate resistant prostate cancer     SRS634616       GSM1410446: PC8; Homo sapiens; ncRNA-Seq        Illumina HiSeq 2000   Exosome isolation and RNA extraction was reported previously. The plasma exosomes were isolated using ExoQuick Solution (SBI, Mountain View, CA). RNA was extracted with the miRNeasy Micro Kit (QIAGEN, Valencia, CA). RNA quantity was determined by a fluorometer (Qubit 2.0, Life Technologies, Carlsbad, CA). Agilent Small RNA Chip (Agilent Technologies, Santa Clara, CA) was used to determine RNA quality. Small RNA libraries were constructed with 2ng exosomal RNA from each patient as starting material following the instruction of the NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB). Absolute quantities of each indexed library were determined by real-time qPCR using KAPA Library Quantification Kit. Equally 2nM of each index library was pooled and submitted for sequencing using the Illumina HiSeq2000 platform.  SRR1379721,SRR1379722,SRR1379723,SRR1379724
PRJNA252516     SRP043122       Exosomal miR-1290 and miR-375 as prognostic markers in castrate resistant prostate cancer     SRS634617       GSM1410447: PC9; Homo sapiens; ncRNA-Seq        Illumina HiSeq 2000   Exosome isolation and RNA extraction was reported previously. The plasma exosomes were isolated using ExoQuick Solution (SBI, Mountain View, CA). RNA was extracted with the miRNeasy Micro Kit (QIAGEN, Valencia, CA). RNA quantity was determined by a fluorometer (Qubit 2.0, Life Technologies, Carlsbad, CA). Agilent Small RNA Chip (Agilent Technologies, Santa Clara, CA) was used to determine RNA quality. Small RNA libraries were constructed with 2ng exosomal RNA from each patient as starting material following the instruction of the NEBNext Multiplex Small RNA Library Prep Set for Illumina (NEB). Absolute quantities of each indexed library were determined by real-time qPCR using KAPA Library Quantification Kit. Equally 2nM of each index library was pooled and submitted for sequencing using the Illumina HiSeq2000 platform.  SRR1379725,SRR1379726,SRR1379727

参考

https://www.ncbi.nlm.nih.gov/books/NBK179288/

https://www.biostars.org/p/97782/

####################################################################

#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任

#Author: Jason

#####################################################################