motifADE
A tool for identifying motifs associated with differential expression (and much more!)
motifADE is free software distributed under the GNU General Public Licence, version 3.
The original motifADE source code from Mootha et al. PNAS 2004 is available upon request.
Contact: Dan Arlow < arlow [at] broad [dot] mit [dot] edu >
Downloads
motifADE source code [motifADE_src.tar.gz]
UCSC genome browser TSS +/- 1kb sequences for all mm8 and hg18 RefSeq transcripts [tx_1000_1000.tar.gz]
UCSC genome browser 3Õ UTR sequences for all mm8 and hg18 RefSeq transcripts [3utr.tar.gz]
Annotation files for mapping Affymetrix probe set IDs to RefSeq transcript IDs [annotation_files.tar.gz]
Orthology files for mapping between orthologous mouse and human RefSeq Transcript IDs [orthology_files.tar.gz]
Example differential expression data [example_data.tar.gz]
Installing
Compiling the sources is easy on most unix/linux distributions; simply navigate to the directory with the source code and type ÒmakeÓ, and a binary called ÒmotifADEÓ should appear in the directory. ItÕs up to you to move it to a more appropriate place.
README
motifADE 2007 A tool for discovering motifs associated with differential expression (and much more!) Copyright (C) 2003-2008 Dan Arlow Vamsi Mootha Laboratory Broad Institute of MIT and Harvard Department of Systems Biology, Harvard Medical School ******************************************************************************** motifADE is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. motifADE is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with motifADE. If not, see <http://www.gnu.org/licenses/>. ******************************************************************************** This software package implements the motifADE algorithm described in "Erra and Gabpa/b specify PGC-1a-dependent oxidative phosphorylation gene expression that is altered in diabetic muscle" Published online on April 20, 2004, (doi: 10.1073/pnas.0401401101) and extensions thereto. The typical way to run motifADE is to provide differential gene expression data and upstream sequences from all genes, and motifADE produces putative transcription factor binding sites that are significantly associated with induction and repression of genes using a rank-sum test. The "typical run" methodology is similar to Reduce by Bussemaker et al. and PREGO by Tanay, except that motifADE can use sequence conservation of motifs in orthologous regions to improve specificity, and motifADE can appropriately handle input sequences of varying lengths. A common variation is to provide 3' UTR sequences to identify putative microRNA target sites. motifADE also has tools for testing if a priori defined motifs are significantly associated with differential expression in a data set, tools for testing if motifs are over-represented in a given set of genes, and tools for extracting the flanking sequence around motif instances. SYNOPSIS: usage: motifADE [-u num_samples | -d | -t | -s | -n | -f gene_set | -h gene_set] [-a adj_sig_level | -p nominal_sig_level] [-b] [-r neighborhood_radius] [-k kmers | -g gaps | -m iupac | -w pwms | -S incidence] [-i | -I | -L bp_up,bp_down] [-O order_file | -E expression_file] [-A annotation_file] [-M] [-P] [-v] [-C orthology_file -o FASTA_ortholog_file] FASTA_promoter_file DESCRIPTION: motifADE searches for motifs associated with differential gene expression using the algorithm described in "Erra and Gabpa/b specify PGC-1a-dependent oxidative phosphorylation gene expression that is altered in diabetic muscle" published 4/20/2007 in Proc. Natl. Acad. Sci. USA. with several extensions. To run motifADE in the way described in Mootha et al., you must provide a FASTA file of upstream sequences, and expression data from the genes downstream of those sequences. Each sequence in the FASTA file must contain a unique alphanumeric sequence ID in its header (typically a RefSeq Transcript ID.) To provide the expression data, you have four options: 1. a text file of ordered sequence IDs, one per line 2. a text file in which each line contains an sequence ID followed by a tab and a numerical measure of differential expression 3. a text file of rank-ordered Affymetrix probe set IDs, one per line 4. a text file in which each line contains an Affymetrix probe set ID followed by a tab and a numerical measure of differential expression Note that for 3 and 4, you must additionally provide an Affymetrix CSV annotation file for associating the probe set IDs with the RefSeq Transcript IDs in the FASTA file. In addition to finding motifs associated with differential expression, motifADE can test for enrichment of motifs within an a priori-defined gene set, like most other motif analysis tools. motifADE can use information about conservation between species to enhance the specificity of its motif-finding algorithm. To search for conserved motifs, you must provide an orthology table from JAX labs and a FASTA file of orthologous upstream sequences. motifADE can test five different types of motifs: 1. all "k-mer" motifs, i.e. all sequences over [ACGT] with k letters 2. all "gap-k-mer" motifs of a particular format, i.e. all ways of replacing the x's in the string xxx--xxx with letters from [ACGT], and the dashes represent "don't cares" 3. explicitly specified motifs over the IUPAC degeneracy alphabet, possibly including several patterns separated by the alternation symbol "|" 4. position-specific scoring matrix ("weight matrix") motifs, given in a format explained below 5. pre-computed motif target sets; sets of genes can be tested for association with differential expression as if they contained some motif (a bit like Gene Set Enrichment Analysis.) OPTIONS: Statistics Options -u [num_samples] Use the Mann-Whitney test The default behavior is to use the standard Mann-Whitney test, but in cases when the sequences are of unequal lengths, it is more appropriate to assume a null distribution in which the probability that a sequence is chosen is proportional to its length. This option estimates the parameters of a normal null distribution for the rank-sum statistic using num_samples Monte Carlo samples. -d Use the Kolmogorov-Smirnov test -t Use Student's t-test -s Use Student's t-test for stepwise regression This option is not recommended. The motifADE implementation of stepwise regression simply scans motifs in arbitrary order using the t-test and when a motif passes the significance threshold, subtracts the effect (i.e. difference of means) from the differential expression data, and continues along. A better implementation would test all motifs and then accept the most significant motif, and then repeat the process, but this would be more computationally intensive. -f gene_set Test the enrichment of motifs in the set of genes listed in the file gene_set, (one per line,) using the normal approximation to the binomial cumulative distribution -h gene_set Test the enrichment of motifs in the set of genes listed in the file gene_set using the hypergeometric cumulative distribution -p nominal_sig_level Only report statistics for motifs whose nominal P-values are smaller than nominal_sig_level -a adj_sig_level Only report statistics for motifs whose nominal P-values are smaller than adj_sig_level divided by the total number of motifs tested in this run. (e.g. For typical k-mer scanning, the threshold is set to adj_sig_level / 4^k for each k) -n Do nothing; useful with reporting options like -M when you just want to know the mapping results Conservation Options -C orthology_file JAX labs orthology table -o FASTA_ortholog_file FASTA file of orthologous upstream sequences Expression Data Options -O order_file Use the rank-ordered sequence IDs or Affymetrix probe set IDs in order_file (one per line) representing differential expression data -E expression_file Use the sequence IDs or Affymetrix probe set IDs with numerical measures of differential expression in expression_file (tab-delimited) for differential expression data -A annotation_file Use the Affymetrix CSV annotation file (or a custom CSV or TSV file with the columns "Probe Set ID" and "RefSeq Transcript ID") to map Affymetrix probe set IDs to RefSeq Transcript IDs in the FASTA file -M Print information about the number of sequences in the FASTA file, the number of entries in the expression file, and the number of sequences that were ultimately associated with exactly one expression entry Motif Discovery Options -k kmer_sizes Scan all k-mer motifs of the given sizes; kmer_sizes is a comma- delimited list of integers e.g. 6,7,8,9 -g gap_formats Scan all "gap-k-mer" motifs in each of the given formats; gap_formats is a comma-delimited list of gap formats e.g. xxx-xxx,xxx--xxx,xx-xx-xx -m motif_file Scan IUPAC alphabet motifs specified in motif_file -w pssm_file Scan PSSM motifs from pssm_file given in the format described below -S incidence_file Scan pre-computed target sets from incidence_file in the "incidence list" format described below -b "Bidirectional" scanning -- allow instances of the motifs to occur on either strand of DNA; cannot be used with -S -r neighborhood_radius Approximate motif search; match motifs with up to neighborhood_radius mismatches from the motifs being scanned; cannot be used with -w or -S Reporting Options -M Print information about the number of sequences in the FASTA file, the number of entries in the expression file, and the number of sequences that were ultimately associated with exactly one expression entry -P Print the sequences that were ultimately used, with annotation in their headers containing the expression data that was ultimately used -v Verbose mode; print extra information about some tasks as they are performed Incidence Reporting Options -i Report the incidence of the motifs in "incidence list" format (described below) -I Report the incidence of the motifs in "incidence matrix" format; i.e. a matrix with one row for each gene and one column for each motif, with a 1 in the cell (i,j) if motif j occurs in gene i, otherwise 0 -L bp_up,bp_down Report information about each instance of each identified motif, including the ID of the sequence in which it was found and associated differential expression value, (if applicable,) and also the sequence of the instance and bp_up and bp_down of flanking sequence surrounding it. NOTE: both bp_up and bp_down are measured from the START OF THE INSTANCE, i.e. if you are searching for TGACCTTNA and you want 3 bp upstream and downstream, then you have to use -L 3,12 because the motif is 9 bases long FILE FORMATS FASTA sequence file Standard FASTA format for DNA sequence with a required header convention; for each sequence, ">" followed by a unique alphanumeric sequence ID string and possibly a space followed by any additional information which motifADE will ignore. Subsequently, DNA sequence over the alphabet [ACGTN], spanning any number of lines. Sequences on separate lines will be concatenated. example: >seq_001 CATGACTGCATGCAGAGGTCATACTGTCGATGCATGATGACCTCTG >seq_002 ACTCTTCCGGTACTGACCTGTACGATTACGTACGTCAGATCGACGTACTGAC >SEQ_003 CATGACGTTACGTAGGTCAACGATCTGTTGACCTACG IUPAC motif file Tab-delimited file of the format NAME [tab] PATTERN [newline], where NAME is a user-specified name for the motif, and PATTERN is a set of sequences over the IUPAC degeneracy alphabet separated by "|"s. For the IUPAC code, see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html example: HS1 AGGTCA HS2 TGACCT NRF1 GCGCAYGCGC|GCGCRTGCGC NRF2 CTTCCG|CGGAAG PSSM motif file File format for representing position-specific scoring matrix ("weight matrix") motifs. For each motif, ">" followed by a user-specified name for the motif, then a tab, then the threshold log-score for the motif, then a newline, the string "A [tab] C [tab] G [tab] T", then a newline, then counts or frequencies for each position of the matrix separated by newlines example: >V$E2F_02 -1.18562366565774 A C G T 0 0 0 12 T 0 0 0 12 T 0 0 0 12 T 0 6 6 0 S 0 2 10 0 G 0 12 0 0 C 0 0 12 0 G 0 11 1 0 C >V$NFKAPPAB50_01 -1.40375929381938 A C G T 0 0 18 0 G 0 0 18 0 G 0 0 18 0 G 2 0 16 0 G 16 1 0 1 A 0 0 3 15 T 0 7 1 10 Y 0 16 0 2 C 0 18 0 0 C 0 17 1 0 C Incidence list file First a header line of the format "Motif [tab] Incidence", then for each motif, NAME [tab] INCIDENCE, where NAME is the motif's user-specified name, and INCIDENCE is a comma-delimited list of sequence IDs for sequences that contain the motif, or expression IDs if -A was used. example: Motif Incidence HS1 seq_001,seq_003 HS2 seq_001,seq_002,seq_003 NRF2 seq_002 SAMPLE OUTPUT ./motifADE -k 7 -a .05 \ -A annotation_files/MG_U74Av2_annot.maximum_matching.tsv \ -E example_expression_data/PGC_day3_snr \ -C orthology_files/JAX_mouse_to_human.tsv \ -o sequence_databases/tx/ucsc.refGene_hg18_tx_1000_1000 \ sequence_databases/tx/ucsc.refGene_mm8_tx_1000_1000 Motif Frequency Delta-Median Z-score P-value Adjusted P-value AAGGTCA 0.0498851 0.568521 5.7695 7.9508e-09 0.000130266 CTTCCGG 0.0809195 0.326803 4.73282 2.21426e-06 0.0362784 GACCTTG 0.0445977 0.528876 5.17365 2.29565e-07 0.00376119 CGGAAGT 0.0696552 0.377269 4.73151 2.2286e-06 0.0365133 TGACCTT 0.0471264 0.622672 6.87246 6.3104e-12 1.0339e-07