除了用GEO2下载数据外,还可以自己直接下载。在看GSE数据集的时候,会看到这三个文件。

Download family Format Description
SOFT formatted family file(s) SOFT SOFT family files are text files that incorporate complete data and meta data for all Platform, Sample and Series records in the family
MINiML formatted family file(s) MINiML MINilML family files are XML files that incorporate complete data and metadata for all Platform, Sample and series records in the family
Series Matrix File(s) TXT Series matrix files are text files that include a tab-delimited value-matrix table generated from the VALUE' column of each Sample, headed by Sample and Series metadata. These files are suitable for loading into spreadsheet applications such as Excel. CAUTION: data are extracted directly from the original records with no consideration as to whether the values are directly comparable.

下数据当然可以用GEO2R,不一定每次都自己下载原始文件,但了解文件格式和内容是很重要的。SOFT formatted family file,MINiML formatted family file,Series Matrix File。以https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68849为例。

SOFT formatted family file

1,Meta信息

这个文件的头数据记录数据集的编号,实验介绍,包含的样本,使用的平台,上传人等信息,和网页显示的对应。^开头表示entity实体,!表示entity的属性,#表示描述行。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
^DATABASE = GeoMiame
!Database_name = Gene Expression Omnibus (GEO)
!Database_institute = NCBI NLM NIH
!Database_web_link = http://www.ncbi.nlm.nih.gov/geo
!Database_email = geo@ncbi.nlm.nih.gov
^SERIES = GSE68849
!Series_title = Impact of influenza A on human plasmacytoid dendritic cells (pDC) gene expression
!Series_geo_accession = GSE68849
!Series_status = Public on Feb 01 2016
!Series_submission_date = May 13 2015
!Series_last_update_date = Aug 13 2018
!Series_pubmed_id = 26826244
!Series_summary = Over 2600 genes were differentially expressed in pDCs exposed to influenza A compared with the controls (no virus). Multiple functional group clusters of genes were impacted, including those involving antiviral responses and also metabolism, such as the glycolysis, oxidative phosphorylation, glucose metabolic process, and positive regulation of macromolecule biosynthesis clusters.
!Series_overall_design = Blood pDCs were purified from 5 human donors, exposed ex vivo for 8 h to either influenza A or no virus, harvested, and RNA was extracted for subsequent DNA microarray analysis.
!Series_type = Expression profiling by array
!Series_contributor = Gagan,,Bajwa
!Series_contributor = Michelle,A,Gill
!Series_sample_id = GSM1684095
!Series_sample_id = GSM1684096
!Series_sample_id = GSM1684097
!Series_sample_id = GSM1684098
!Series_sample_id = GSM1684099
!Series_sample_id = GSM1684100
!Series_sample_id = GSM1684101
!Series_sample_id = GSM1684102
!Series_sample_id = GSM1684103
!Series_sample_id = GSM1684104
!Series_contact_name = Michelle,Ann,Gill
!Series_contact_department = Pediatrics
!Series_contact_institute = University of Texas Southwestern Medical Center
!Series_contact_address = 5323 Harry Hines Blvd
!Series_contact_city = Dallas
!Series_contact_zip/postal_code = 75390
!Series_contact_country = USA
!Series_supplementary_file = ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68849/suppl/GSE68849_RAW.tar
!Series_supplementary_file = ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE68nnn/GSE68849/suppl/GSE68849_non-normalized.txt.gz
!Series_platform_id = GPL10558
!Series_platform_taxid = 9606
!Series_sample_taxid = 9606
!Series_relation = BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA283884
^PLATFORM = GPL10558
!Platform_title = Illumina HumanHT-12 V4.0 expression beadchip
!Platform_geo_accession = GPL10558
!Platform_status = Public on Jun 17 2010
!Platform_submission_date = Jun 17 2010
!Platform_last_update_date = Mar 04 2020
!Platform_technology = oligonucleotide beads
!Platform_distribution = commercial
!Platform_organism = Homo sapiens
!Platform_taxid = 9606
!Platform_data_row_count = 48107
#ID = Unique identifier for the probe (across all products and species)
#Species = 
#Source = Transcript sequence source name
#Search_Key = Internal id useful for custom design array
#Transcript = Internal transcript id
#ILMN_Gene = Internal gene symbol
#Source_Reference_ID = Id in the source database
#RefSeq_ID = Refseq id
#Unigene_ID = Unigene id
#Entrez_Gene_ID = Entrez gene id
#GI = Genbank id
#Accession = Genbank accession number
#Symbol = Gene symbol from the source database
#Protein_Product = Genbank protein accession number
#Probe_Id = 

2. 探针信息

以!platform_table_begin开始,!platform_table_end结束,包含了探针的信息,探针的序列、ID、对应的基因及GO注释。

3,样本GSM信息

以^SAMPLE = GSMXXXXX开始,包含了样本的注释信息,比页面显示的丰富。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
!Sample_title = Donor 1 - No virus control   - 8h
!Sample_geo_accession = GSM1684095
!Sample_status = Public on Feb 01 2016
!Sample_submission_date = May 13 2015
!Sample_last_update_date = Feb 01 2016
!Sample_type = RNA
!Sample_channel_count = 1
!Sample_source_name_ch1 = Blood pDCs
!Sample_organism_ch1 = Homo sapiens
!Sample_taxid_ch1 = 9606
!Sample_data_row_count = 47321
#ID_REF = 
#VALUE = Quantile normalized
#5455178010_A.Avg_NBEADS = 
#5455178010_A.BEAD_STDERR = 
#5455178010_A.Detection Pval = 

对于这个样本,可以看他的芯片的值,!sample_table_begin,!sample_table_end结束,ID_REF就是探针的ID,和前面的注释信息ID对应,有时候一个基因对应多个探针,需要取平均或者最大值,就是根据这个信息匹配的。因为这个GSE数据用的是Illumina beadarray的平台,bead是随机分布的,所以还有beads数目、平均值和方差的信息,这些信息与网页中GSM显示的信息一致,比如https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1684095,这里作者提到用了Quantile normalized方法。

ID_REF VALUE 5455178010_A.Avg_NBEADS 5455178010_A.BEAD_STDERR 5455178010_A.Detection Pval
ILMN_1762337 111.824 24 5.172565 0.206494
ILMN_2055271 126.907 21 8.396618 0.038961
ILMN_1736007 91.20126 17 4.822516 0.781818

MINilML family files

MINilML family files和上面的文件内容一样,只不过用xml格式来存储的。

Series Matrix File

Series Matrix File是大部分是作者整理好的矩阵文件,头信息也包含了Meta信息,meta信息统一用!开始。包含了每个探针在每个样本中的信号值。

ID_REF GSM1684095 GSM1684096 GSM1684097
ILMN_1343291 23599.32 22303.32 25980.17
ILMN_1651228 9156.968 8897.159 9326.865
ILMN_1651229 253.5435 253.7028 230.3505
ILMN_1651230 96.60646 101.3388 90.95158
ILMN_1651232 112.7772 106.5139 97.73111
ILMN_1762337 111.824 154.114 108.3252

Supplementary file

这个数据集提供了没有normalize的表达矩阵和注释,但大部分情况下,大多数数据集这个地方包含了原始的芯片数据,比如最原始的图像信息,一般在RAW.tar文件中。有的作者还有把一些临床信息的表格放到这个部分。

参考

https://www.ncbi.nlm.nih.gov/geo/info/soft.html

####################################################################

#版权所有 转载请告知 版权归作者所有 如有侵权 一经发现 必将追究其法律责任

#Author: Jason

#####################################################################