Frequently Asked Questions


1. What is the difference between Seed regions and Query regions?
2. Should I set my query regions equal to seed regions?
3. How should I enter genomic regions?
4. What happens if associated regions overlap?
5. Should I upload my own gene list?
6. How should I define a gene list?
7. GRAIL did not accept my SNP?
8. GRAIL did not accept my defined as a gene list?
9. GRAIL did not accept my defined region?
10. GRAIL did not accept my email address?
11. GRAIL did not accept my job description?
12. How long does GRAIL take?
13. What is the GRAIL output?
14. How can I export the data to some other format?
15. Is a detailed description of the method available?
16. Isn't there another bioinformatics algorithm named GRAIL?
17. I have additional questions - who can I contact?




Major GRAIL site updates


March 2009.
GRAIL text database updated to remove genes not actual within the genome in order to
(1) improve efficiency
(2) remove statistical inflation associated with GRAIL p-values

Text database is current to only December 2006.

The GRAIL site is updated to allow users to input a list if gene IDs to define a region.

August 2009.
GRAIL site is updated to allow users to chose from multiple functional databases:
1. A text database current to March 2009
2. A text database current to December 2006 (original)
3. A Gene Ontology based database

Also users may chose to upload their own gene file - containing one
Entrez ID per line. This has the effect of restricting the analysis
to a subset of genes of interest.

May 2010.
GRAIL site is updated to allow users to chose from additional functional databases:
1. Three text databases, current to December 2006, March 2009, and May 2010
2. A Gene Ontology based database
3. A database based on a Tissue Expression Atlas from Novartis Also an option to correct for biases introduced by gene size is added.




1. What is the difference between Seed regions and Query regions?


Seed regions are typically higher confidence associations, such as associations at the top of an association study with highly significant association scores, or a list of validated associations.


Query regions are associations that are being tested against the seed regions, typically regions with lower confidence of associations.



2. Should I set my query regions equal to seed regions?


If you set the query regions equal to the seed regions then GRAIL tests the set of genomic regions to see if there are genes within them that are highly related to genes in other independent regions. This is an effective strategy if there are no validated hits, or if the number of validated hits is small, and an adequate seed SNP cannot be constructed. This is also an appropriate strategy if the user is searching for pathways outside of the ones suggested by validated hits.



3. How should I enter genomic regions?


A genomic regions can be defined in three separate ways - as (1) by a tagging SNP, (2) an explicitly defined genomic region, or (3) a list of genes. A list of associated regions can contain all three. They can be uploaded in a file or entered in the dialog box. Each line should define a single region.


1. Regions can be listed as SNPs in HapMap (phase 2). For each region, list only the SNP name on the line as an rs number. In this case, the region is automatically defined using linkeage disequilibrium (LD) properties. Specifically, it finds the furthest neighboring SNPs in the 3’ and 5’ direction in LD (r2 > 0.5). It then proceed outwards in each direction to the nearest recombination hotspot (Myers et al Science 2006). If no genes are in that region - the region is expanded 250 kb in either direction. All mapping is done in HG17 and HapMap Release #21.


 2. Regions can be explicitly defined. In this case indicate the Human Genome Assembly ( 17 or 18 ) that your regions are defined in. Then for each line enter a region with four fields in order separated by a space: a unique word identifier, the chromosome that the region is on, the start position (base pairs), and the end position (base pairs). 


 3. Regions can be defined as a gene list. In this case for each line enter a unique word identifier, followed by the term GID. Then list each gene separated by spaces using their Entrez ID.



4. What happens if associated regions overlap?


We encourage users to enter distinct non-overlapping associated regions. This allows the most facilie interpretation of the data. However, the method is designed to be robust to overlapping regions.


In practice, if two seed regions share a common gene - then they are merged into a single seed region, prior to conducting statistical analyses.This prevents violation of GRAIL's independence assumption. 


Query regions are scored according to the best score of the gene within it - so if query regions overlap the same gene could drive the significance scores for multiple associated regions overlapping the gene.



5. Should I upload my own gene list?


In cases where there is data from a screen where not all genes in the genome might have been queried, then there is an advantage to using an appropriate gene list that represents only those genes. For example, if an assay queried only a handful of chromosomes, than a restricted list of genes is appropriate.



6. How should I define a gene list?


The gene list is a list of Entrez IDs (a string of base 10 digits), where each individual line has a unique ID. Entrez IDs can be found at the Entrez web site.



7. GRAIL did not accept my SNP?


GRAIL only accepts SNPs in HapMap phase II. SNPs must be listed as rs numbers. This is necessary to define the surrounding genomic region. See the HapMap website for more information. If your SNP of interest is not in the HapMap, please enter a nearby SNP that is highly correlated with the given SNP. Alternatively you can explicitly define the genomic region as decribed above.



8. GRAIL did not accept my region defined as a gene list?


GRAIL accepts regions defined as a list of genes. The first field is a single word which is used as an identifier. The second field must be 'GID' so that GRAIL knows to look for gene IDs subsequently. After that genes must be listed, separated by spaces on the same line. Each gene must be defined by their human Entrez Gene IDs, and NOT by their HUGO gene name or other identifiers. For example for PTPN22 enter 26191.



9. GRAIL did not accept my defined region?


GRAIL accepts regions defined by genomic coordinates. The first field is a single word which is used as an identifier. The second field defines the chromosome ranging from 1-22, X, or Y. The third and fourth fields are the start and end positions in base pairs. The third field must be a lower value than the fourth field. For example an appropriate entry for a region on Chromosome 10 from 10MB ro 20 MB is:

Region1 10 10000000 20000000



10. GRAIL did not accept my email address?


GRAIL permits the use of only email addresses that are appropriately formatted.



11. GRAIL did not accept my job description?


GRAIL requires a two word job description in order to allow jobs to be as identifiable to you as possible.



12. How long does GRAIL take?


A typical analysis of ~50 SNPs should be complete in <30 minutes. Depending on the load of the cluster, and the size of the Seed and Query Set, analyses can take several hours.



13. What is the GRAIL output?


The GRAIL output link has several components. 


At the very top is a link to details about the Seed and Query Regions - these links take you to a list of the regions. Listed along with the region identifier is the chromosomal position, the genomic range, and the genes in that region. Genomic Position for SNPs is always in HG17, while for explicitly defined genomic regions the position is in either HG17 or HG18 depending on the user's input.


The next set of outputs is the assignment of significance scores to each region. So for each genomic region, a p-value is assigned by GRAIL based on the best candidate gene. The candidate gene, that is the gene with the most number of relationships to other associated genes in independent regions is listed in the third column for each region. In the case that their are multiple equivalent genes, they are all listed. The p-value, which is based on the candidate gene after multiple hypothesis correcting for the number of genes in the region is also listed 


The next set of outputs is a set of descriptive keywords in order of their informativeness.


Finally individual genes are listed (only those with p<0.2). These may be different from the p-values at the top since they do not correct for multiple hypthesis testing. On the right are listed other associated genes that they were identified as similar to. After each gene in parentheses is listed the rank similarity, where the rank 1 gene is the gene itself.


14. How can I export the data to some other format?


The output data file is in tab delimited format, and can be easily copied and pasted into Excel or other softwares.



15. Is a detailed description of the method available?


The manuscript describing GRAIL is now available online at PLoS Genetics.


16. Isn't there another bioinformatics algorithm named GRAIL?


The GRAIL (Gene Recognition and Assembly Internet Link) program for exon prediction was originally described in 1991 by Uberacher and Mural (Proc Natl Acad Sci USA). For years, along with GENSCAN (Burge & Karlin, J. Mol Biol. 1997), it was widely cited and utilized for defining coding regions within the genome. Until 2005 it developed and supported by the Oak Ridge National Laboratory - since then it has been unavailable. We apologize for any confusion this might have caused.



17. I have additional questions - who can I contact?


Please email the GRAIL team at GRAIL at broad . mit . edu