Difference between revisions of "RNA-Seq Data and Ensembl CHIP files"

From GeneSetEnrichmentAnalysisWiki
Jump to navigation Jump to search
m (Note on transcript-level quantifications)
(Note on removal of transcript-level CHIP files)
Line 8: Line 8:
 
A GSEA analysis requires three different types of input data: a gene expression dataset in GCT format, the corresponding sample annotations in CLS format, and a collection of gene sets in GMT format. GSEA is typically used with gene sets from the Molecular Signatures Database (MSigDB), which consist of HUGO human gene symbols. However, gene expression data files may use other types of identifiers, depending on how the data were produced. To proceed with the analysis, GSEA converts the identifiers found in the data file to match the human symbols used in the gene set files. The conversion is performed using a CHIP file that provides the mapping between the two types of identifiers. Over the years, we have been providing CHIP files for all major microarray platforms. For example, we have CHIP files that list the mappings between Affymetrix probe set IDs and human genome symbols.
 
A GSEA analysis requires three different types of input data: a gene expression dataset in GCT format, the corresponding sample annotations in CLS format, and a collection of gene sets in GMT format. GSEA is typically used with gene sets from the Molecular Signatures Database (MSigDB), which consist of HUGO human gene symbols. However, gene expression data files may use other types of identifiers, depending on how the data were produced. To proceed with the analysis, GSEA converts the identifiers found in the data file to match the human symbols used in the gene set files. The conversion is performed using a CHIP file that provides the mapping between the two types of identifiers. Over the years, we have been providing CHIP files for all major microarray platforms. For example, we have CHIP files that list the mappings between Affymetrix probe set IDs and human genome symbols.
  
In RNA-Seq, gene expression is quantified by counting the number of sequencing reads that aligned to a genomic range, according to a reference genome assembly or transcript annotations. The majority of tools use [http://www.ensembl.org/info/about/index.html Ensembl] reference annotations for this purpose. To facilitate GSEA analysis of RNA-Seq data, we now also provide CHIP files to convert  human and mouse Ensembl IDs to HUGO gene symbols. Ensembl annotation uses a [http://www.ensembl.org/info/genome/stable_ids/index.html system of stable IDs] that have prefixes based on the species name plus the feature type, followed by a series of digits and a version, e.g., <code>ENSG00000139618.1</code>. The new GSEA Ensembl CHIP files provide mappings for human gene and transcript identifiers (i.e., Ensembl IDs with prefixes ENSG and ENST), and for mouse gene and transcript identifiers (i.e., Ensemble IDs with prefixes ENSMUSG and ENSMUST). Although transcript-level CHIP files are provided, it is not recommended to perform GSEA on transcript-level quantification data as GSEA lacks the functions necessary to properly sum transcript counts to gene level.
+
In RNA-Seq, gene expression is quantified by counting the number of sequencing reads that aligned to a genomic range, according to a reference genome assembly or transcript annotations. The majority of tools use [http://www.ensembl.org/info/about/index.html Ensembl] reference annotations for this purpose. To facilitate GSEA analysis of RNA-Seq data, we now also provide CHIP files to convert  human and mouse Ensembl IDs to HUGO gene symbols. Ensembl annotation uses a [http://www.ensembl.org/info/genome/stable_ids/index.html system of stable IDs] that have prefixes based on the species name plus the feature type, followed by a series of digits and a version, e.g., <code>ENSG00000139618.1</code>. The new GSEA Ensembl CHIP files provide mappings for human, mouse, and rat gene identifiers (i.e., Ensembl IDs with prefixes ENSG, ENSMUSG, ENSRNOG).
  
 
To run GSEA with gene expression data specified with Ensembl identifiers:
 
To run GSEA with gene expression data specified with Ensembl identifiers:
Line 16: Line 16:
 
# Choose gene sets to test - we usually recommend starting with the Hallmarks collection.
 
# Choose gene sets to test - we usually recommend starting with the Hallmarks collection.
 
# Choose the CHIP file that matches the identifiers in the GCT file:
 
# Choose the CHIP file that matches the identifiers in the GCT file:
#* <code>ENSEMBL_human_gene.chip</code> => Ensembl ID prefix ENSG
+
#* <code>Human_ENSEMBL_Gene_ID_MSigDB.vX.chip</code> => Ensembl ID prefix ENSG
#* <code>ENSEMBL_human_transcript.chip</code> => Ensembl ID prefix ENST
+
#* <code>Mouse_ENSEMBL_Gene_ID_MSigDB.vX.chip</code> => Ensembl ID prefix ENSMUSG
#* <code>ENSEMBL_mouse_gene.chip</code> => Ensembl ID prefix ENSMUSG
+
#* <code>Rat_ENSEMBL_Gene_ID_MSigDB.vX.chip</code> => Ensembl ID prefix ENSRNOG
#* <code>ENSEMBL_mouse_transcript.chip</code> => Ensembl ID prefix ENSMUST
 
  
We have also added the gene-level Ensembl IDs to the website for use with the Investigate Gene Sets tools such as Compute Overlaps.  As noted above, it is necessary to remove the version suffix from any supplied IDs. At this time, transcript-level identifiers are not accepted.
+
We have also added the gene-level Ensembl IDs to the website for use with the Investigate Gene Sets tools such as Compute Overlaps.  As noted above, it is necessary to remove the version suffix from any supplied IDs.
 +
<br>
 +
<b>Note: <b>While GSEA can accept transcript-level quantification directly and sum these to gene-level, these quantifications are not typically properly normalized for between sample comparisons. As such, transcript level CHIP annotations are no longer provided by the GSEA-MSigDB team at this time.

Revision as of 15:46, 14 November 2019

GSEA Home | Downloads | Molecular Signatures Database | Documentation | Contact

A GSEA analysis requires three different types of input data: a gene expression dataset in GCT format, the corresponding sample annotations in CLS format, and a collection of gene sets in GMT format. GSEA is typically used with gene sets from the Molecular Signatures Database (MSigDB), which consist of HUGO human gene symbols. However, gene expression data files may use other types of identifiers, depending on how the data were produced. To proceed with the analysis, GSEA converts the identifiers found in the data file to match the human symbols used in the gene set files. The conversion is performed using a CHIP file that provides the mapping between the two types of identifiers. Over the years, we have been providing CHIP files for all major microarray platforms. For example, we have CHIP files that list the mappings between Affymetrix probe set IDs and human genome symbols.

In RNA-Seq, gene expression is quantified by counting the number of sequencing reads that aligned to a genomic range, according to a reference genome assembly or transcript annotations. The majority of tools use Ensembl reference annotations for this purpose. To facilitate GSEA analysis of RNA-Seq data, we now also provide CHIP files to convert human and mouse Ensembl IDs to HUGO gene symbols. Ensembl annotation uses a system of stable IDs that have prefixes based on the species name plus the feature type, followed by a series of digits and a version, e.g., ENSG00000139618.1. The new GSEA Ensembl CHIP files provide mappings for human, mouse, and rat gene identifiers (i.e., Ensembl IDs with prefixes ENSG, ENSMUSG, ENSRNOG).

To run GSEA with gene expression data specified with Ensembl identifiers:

  1. Prepare the GCT gene expression file such that identifiers are in the form of Ensembl IDs, but without the version suffix, e.g., ENSG00000139618.
  2. For RNA-Seq data, you will need normalize and filter out low count measurements, and perform other preprocessing as needed. Consult your local bioinformatician for help if unsure.
  3. Load the GCT and corresponding CLS files into GSEA.
  4. Choose gene sets to test - we usually recommend starting with the Hallmarks collection.
  5. Choose the CHIP file that matches the identifiers in the GCT file:
    • Human_ENSEMBL_Gene_ID_MSigDB.vX.chip => Ensembl ID prefix ENSG
    • Mouse_ENSEMBL_Gene_ID_MSigDB.vX.chip => Ensembl ID prefix ENSMUSG
    • Rat_ENSEMBL_Gene_ID_MSigDB.vX.chip => Ensembl ID prefix ENSRNOG

We have also added the gene-level Ensembl IDs to the website for use with the Investigate Gene Sets tools such as Compute Overlaps. As noted above, it is necessary to remove the version suffix from any supplied IDs.
Note: While GSEA can accept transcript-level quantification directly and sum these to gene-level, these quantifications are not typically properly normalized for between sample comparisons. As such, transcript level CHIP annotations are no longer provided by the GSEA-MSigDB team at this time.