Help with Investigating Gene Sets

Compute Overlaps

When gene sets share genes, examination of how they overlap can highlight common processes, pathways, and underlying biological themes. This tool evaluates the overlap of a user provided gene set, and an estimate of the statistical significance, with as many MSigDB collections as you choose.

Due to the characteristics of the hypergeometric distribution there are limits to how large the user provided gene set can be, yet still produce meaningful significance estimates. At most 2940 genes will be allowed, anything larger will be rejected.

Enter a list of gene identifiers in the box provided. A pull down menu below the box will allow you to specify how you are identifying genes. Overlaps are computed using HUGO gene symbols and any required conversion is done automatically by the tool. Gene-level Ensembl human and mouse IDs are accepted, but remove any version suffixes from the identifiers (transcript-level IDs are not accepted).

Click on the "compute overlaps" button to display the results, including

  • Statistics:
    • # overlaps shown lists the number of overlapping gene sets displayed in the report
      By default, the report displays the 10 gene sets in the collection that best overlap with your gene set. If you compute overlaps from the Investigate Gene Sets page, you can choose the number of overlapping gene sets to display in the report.
    • # gene sets in collection lists the total number of gene sets being analyzed
    • # genes in comparison lists the number of genes in your gene set
    • # genes in collection lists the number of unique genes in the gene sets being analyzed
  • Descriptions of the overlapping gene sets, including
    • Link to the gene set page
    • Number of genes in the gene set
    • Description of the gene set
    • Number of genes in the overlap between this gene set and your gene set
    • P value from the hypergeometric distribution for (k-1, K, N - K, n) where
      k is the number of genes in the intersection of the query set with a set from MSigDB
      K is the number of genes in the set from MSigDB
      N is the total number of gene universe (all known human gene symbols)
      n is the number of genes in the query set
      You can read the Wikipedia article on the hypergeometric distribution for more information on how p-values are determined.
    • FDR q-value. This is the false discovery rate analog of hypergeometric p-value after correction for multiple hypothesis testing according to Benjamini and Hochberg.
      You can read the Wikipedia article on the false discovery rate for more information on how q-values are determined.
    • Color bar shading from light green to black, where lighter colors indicate more significant q-values (< 0.05) and black indicates less significant q-values (≥ 0.05).
  • Overlap matrix showing the genes in the overlapping gene sets
    • Rows list the genes in your gene set, with gene descriptions and links to gene annotations
    • Columns list the overlapping gene sets, with links to the gene set pages

Compendia expression profiles

You can display a heat map of the expression levels of the genes in your gene list in the samples of any one of these three compendia of expression data:

  • Human tissue compendium (Novartis). Gene expression profiles from the Novartis normal tissue compendium, as published in Su, A. I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K. A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., et al. (2004) Proc. Natl. Acad. Sci. USA 101, 6062-6067.
  • NCI-60 cell lines (National Cancer Institute). Gene expression profiles from the NCI 60 data set downloaded from the Developmental Therapeutics Program web site (http://dtp.nci.nih.gov/mtargets/download.html). No preprocessing was done other than collapsing probe IDs to gene symbols.

Enter a list of gene identifiers in the box provided. Any required conversion is done automatically by the tool. Gene-level Ensembl human and mouse IDs are accepted, but remove any version suffixes from the identifiers (transcript-level IDs are not accepted).

Choose one of the available compendia and click on "display expression profile". The resulting heat map includes dendrograms clustering gene expression by gene and samples. Genes are identified by probe identifier, gene symbol, description, and gene family.

Gene families

A gene family describes any collection of proteins that share a common feature such as homology or biochemical activity. Available categories and links to the relevant source publications in PubMed:

Enter a list of gene identifiers in the box provided. A pull down menu below the box will allow you to specify how you are identifying genes. Any required conversion is done automatically by the tool. Click on the "show gene families" button to retrieve an overview of your gene list with its members categorized into a small number of carefully chosen gene families.