
motifADE
A tool for identifying motifs associated with differential expression (and much more!)
motifADE is free software distributed under the GNU General Public Licence, version 3.
The original motifADE source code from Mootha et al. PNAS 2004 is available upon request.
Contact: Dan Arlow < arlow [at] broad [dot] mit [dot] edu >
Downloads
motifADE source code [motifADE_src.tar.gz]
UCSC genome browser TSS +/- 1kb sequences for all mm8 and hg18 RefSeq transcripts [tx_1000_1000.tar.gz]
UCSC genome browser 3Õ UTR sequences for all mm8 and hg18 RefSeq transcripts [3utr.tar.gz]
Annotation files for mapping Affymetrix probe set IDs to RefSeq transcript IDs [annotation_files.tar.gz]
Orthology files for mapping between orthologous mouse and human RefSeq Transcript IDs [orthology_files.tar.gz]
Example differential expression data [example_data.tar.gz]
Installing
Compiling the sources is easy on most unix/linux distributions; simply navigate to the directory with the source code and type ÒmakeÓ, and a binary called ÒmotifADEÓ should appear in the directory. ItÕs up to you to move it to a more appropriate place.
README
motifADE 2007
A tool for discovering motifs associated with differential expression
(and much more!)
Copyright (C) 2003-2008 Dan Arlow
Vamsi Mootha Laboratory
Broad Institute of MIT and Harvard
Department of Systems Biology, Harvard Medical School
********************************************************************************
motifADE is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
motifADE is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with motifADE. If not, see <http://www.gnu.org/licenses/>.
********************************************************************************
This software package implements the motifADE algorithm described in "Erra and
Gabpa/b specify PGC-1a-dependent oxidative phosphorylation gene expression that
is altered in diabetic muscle" Published online on April 20, 2004, (doi:
10.1073/pnas.0401401101) and extensions thereto.
The typical way to run motifADE is to provide differential gene expression data
and upstream sequences from all genes, and motifADE produces putative
transcription factor binding sites that are significantly associated with
induction and repression of genes using a rank-sum test. The "typical run"
methodology is similar to Reduce by Bussemaker et al. and PREGO by Tanay, except
that motifADE can use sequence conservation of motifs in orthologous regions to
improve specificity, and motifADE can appropriately handle input sequences of
varying lengths. A common variation is to provide 3' UTR sequences to identify
putative microRNA target sites. motifADE also has tools for testing if a priori
defined motifs are significantly associated with differential expression in a
data set, tools for testing if motifs are over-represented in a given set of
genes, and tools for extracting the flanking sequence around motif instances.
SYNOPSIS:
usage: motifADE [-u num_samples | -d | -t | -s | -n | -f gene_set | -h gene_set]
[-a adj_sig_level | -p nominal_sig_level]
[-b] [-r neighborhood_radius]
[-k kmers | -g gaps | -m iupac | -w pwms | -S incidence]
[-i | -I | -L bp_up,bp_down]
[-O order_file | -E expression_file] [-A annotation_file] [-M]
[-P] [-v]
[-C orthology_file -o FASTA_ortholog_file] FASTA_promoter_file
DESCRIPTION:
motifADE searches for motifs associated with differential gene expression
using the algorithm described in "Erra and Gabpa/b specify PGC-1a-dependent
oxidative phosphorylation gene expression that is altered in diabetic
muscle" published 4/20/2007 in Proc. Natl. Acad. Sci. USA. with several
extensions.
To run motifADE in the way described in Mootha et al., you must provide a
FASTA file of upstream sequences, and expression data from the genes
downstream of those sequences. Each sequence in the FASTA file must contain
a unique alphanumeric sequence ID in its header (typically a RefSeq
Transcript ID.) To provide the expression data, you have four options:
1. a text file of ordered sequence IDs, one per line
2. a text file in which each line contains an sequence ID followed by a tab
and a numerical measure of differential expression
3. a text file of rank-ordered Affymetrix probe set IDs, one per line
4. a text file in which each line contains an Affymetrix probe set ID
followed by a tab and a numerical measure of differential expression
Note that for 3 and 4, you must additionally provide an Affymetrix CSV
annotation file for associating the probe set IDs with the RefSeq Transcript
IDs in the FASTA file.
In addition to finding motifs associated with differential expression,
motifADE can test for enrichment of motifs within an a priori-defined gene
set, like most other motif analysis tools.
motifADE can use information about conservation between species to enhance
the specificity of its motif-finding algorithm. To search for conserved
motifs, you must provide an orthology table from JAX labs and a FASTA file
of orthologous upstream sequences.
motifADE can test five different types of motifs:
1. all "k-mer" motifs, i.e. all sequences over [ACGT] with k letters
2. all "gap-k-mer" motifs of a particular format, i.e. all ways of replacing
the x's in the string xxx--xxx with letters from [ACGT], and the dashes
represent "don't cares"
3. explicitly specified motifs over the IUPAC degeneracy alphabet, possibly
including several patterns separated by the alternation symbol "|"
4. position-specific scoring matrix ("weight matrix") motifs, given in a
format explained below
5. pre-computed motif target sets; sets of genes can be tested for
association with differential expression as if they contained some motif
(a bit like Gene Set Enrichment Analysis.)
OPTIONS:
Statistics Options
-u [num_samples]
Use the Mann-Whitney test
The default behavior is to use the standard Mann-Whitney test, but in
cases when the sequences are of unequal lengths, it is more appropriate
to assume a null distribution in which the probability that a sequence
is chosen is proportional to its length. This option estimates the
parameters of a normal null distribution for the rank-sum statistic
using num_samples Monte Carlo samples.
-d
Use the Kolmogorov-Smirnov test
-t
Use Student's t-test
-s
Use Student's t-test for stepwise regression
This option is not recommended. The motifADE implementation of stepwise
regression simply scans motifs in arbitrary order using the t-test and
when a motif passes the significance threshold, subtracts the effect
(i.e. difference of means) from the differential expression data, and
continues along. A better implementation would test all motifs and then
accept the most significant motif, and then repeat the process, but this
would be more computationally intensive.
-f gene_set
Test the enrichment of motifs in the set of genes listed in the file
gene_set, (one per line,) using the normal approximation to the
binomial cumulative distribution
-h gene_set
Test the enrichment of motifs in the set of genes listed in the file
gene_set using the hypergeometric cumulative distribution
-p nominal_sig_level
Only report statistics for motifs whose nominal P-values are smaller
than nominal_sig_level
-a adj_sig_level
Only report statistics for motifs whose nominal P-values are smaller
than adj_sig_level divided by the total number of motifs tested in this
run. (e.g. For typical k-mer scanning, the threshold is set to
adj_sig_level / 4^k for each k)
-n
Do nothing; useful with reporting options like -M when you just want to
know the mapping results
Conservation Options
-C orthology_file
JAX labs orthology table
-o FASTA_ortholog_file
FASTA file of orthologous upstream sequences
Expression Data Options
-O order_file
Use the rank-ordered sequence IDs or Affymetrix probe set IDs in
order_file (one per line) representing differential expression data
-E expression_file
Use the sequence IDs or Affymetrix probe set IDs with numerical
measures of differential expression in expression_file (tab-delimited)
for differential expression data
-A annotation_file
Use the Affymetrix CSV annotation file (or a custom CSV or TSV file
with the columns "Probe Set ID" and "RefSeq Transcript ID") to map
Affymetrix probe set IDs to RefSeq Transcript IDs in the FASTA file
-M
Print information about the number of sequences in the FASTA file, the
number of entries in the expression file, and the number of sequences
that were ultimately associated with exactly one expression entry
Motif Discovery Options
-k kmer_sizes
Scan all k-mer motifs of the given sizes; kmer_sizes is a comma-
delimited list of integers e.g. 6,7,8,9
-g gap_formats
Scan all "gap-k-mer" motifs in each of the given formats; gap_formats
is a comma-delimited list of gap formats e.g. xxx-xxx,xxx--xxx,xx-xx-xx
-m motif_file
Scan IUPAC alphabet motifs specified in motif_file
-w pssm_file
Scan PSSM motifs from pssm_file given in the format described below
-S incidence_file
Scan pre-computed target sets from incidence_file in the "incidence
list" format described below
-b
"Bidirectional" scanning -- allow instances of the motifs to occur on
either strand of DNA; cannot be used with -S
-r neighborhood_radius
Approximate motif search; match motifs with up to neighborhood_radius
mismatches from the motifs being scanned; cannot be used with -w or -S
Reporting Options
-M
Print information about the number of sequences in the FASTA file, the
number of entries in the expression file, and the number of sequences
that were ultimately associated with exactly one expression entry
-P
Print the sequences that were ultimately used, with annotation in their
headers containing the expression data that was ultimately used
-v
Verbose mode; print extra information about some tasks as they are
performed
Incidence Reporting Options
-i
Report the incidence of the motifs in "incidence list" format
(described below)
-I
Report the incidence of the motifs in "incidence matrix" format; i.e.
a matrix with one row for each gene and one column for each motif, with
a 1 in the cell (i,j) if motif j occurs in gene i, otherwise 0
-L bp_up,bp_down
Report information about each instance of each identified motif,
including the ID of the sequence in which it was found and associated
differential expression value, (if applicable,) and also the sequence
of the instance and bp_up and bp_down of flanking sequence surrounding
it. NOTE: both bp_up and bp_down are measured from the START OF THE
INSTANCE, i.e. if you are searching for TGACCTTNA and you want 3 bp
upstream and downstream, then you have to use -L 3,12 because the motif
is 9 bases long
FILE FORMATS
FASTA sequence file
Standard FASTA format for DNA sequence with a required header convention;
for each sequence, ">" followed by a unique alphanumeric sequence ID string
and possibly a space followed by any additional information which motifADE
will ignore. Subsequently, DNA sequence over the alphabet [ACGTN], spanning
any number of lines. Sequences on separate lines will be concatenated.
example:
>seq_001
CATGACTGCATGCAGAGGTCATACTGTCGATGCATGATGACCTCTG
>seq_002
ACTCTTCCGGTACTGACCTGTACGATTACGTACGTCAGATCGACGTACTGAC
>SEQ_003
CATGACGTTACGTAGGTCAACGATCTGTTGACCTACG
IUPAC motif file
Tab-delimited file of the format NAME [tab] PATTERN [newline], where NAME
is a user-specified name for the motif, and PATTERN is a set of sequences
over the IUPAC degeneracy alphabet separated by "|"s. For the IUPAC code,
see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
example:
HS1 AGGTCA
HS2 TGACCT
NRF1 GCGCAYGCGC|GCGCRTGCGC
NRF2 CTTCCG|CGGAAG
PSSM motif file
File format for representing position-specific scoring matrix ("weight
matrix") motifs. For each motif, ">" followed by a user-specified name for
the motif, then a tab, then the threshold log-score for the motif, then a
newline, the string "A [tab] C [tab] G [tab] T", then a newline, then
counts or frequencies for each position of the matrix separated by newlines
example:
>V$E2F_02 -1.18562366565774
A C G T
0 0 0 12 T
0 0 0 12 T
0 0 0 12 T
0 6 6 0 S
0 2 10 0 G
0 12 0 0 C
0 0 12 0 G
0 11 1 0 C
>V$NFKAPPAB50_01 -1.40375929381938
A C G T
0 0 18 0 G
0 0 18 0 G
0 0 18 0 G
2 0 16 0 G
16 1 0 1 A
0 0 3 15 T
0 7 1 10 Y
0 16 0 2 C
0 18 0 0 C
0 17 1 0 C
Incidence list file
First a header line of the format "Motif [tab] Incidence", then for each
motif, NAME [tab] INCIDENCE, where NAME is the motif's user-specified name,
and INCIDENCE is a comma-delimited list of sequence IDs for sequences that
contain the motif, or expression IDs if -A was used.
example:
Motif Incidence
HS1 seq_001,seq_003
HS2 seq_001,seq_002,seq_003
NRF2 seq_002
SAMPLE OUTPUT
./motifADE -k 7 -a .05 \
-A annotation_files/MG_U74Av2_annot.maximum_matching.tsv \
-E example_expression_data/PGC_day3_snr \
-C orthology_files/JAX_mouse_to_human.tsv \
-o sequence_databases/tx/ucsc.refGene_hg18_tx_1000_1000 \
sequence_databases/tx/ucsc.refGene_mm8_tx_1000_1000
Motif Frequency Delta-Median Z-score P-value Adjusted P-value
AAGGTCA 0.0498851 0.568521 5.7695 7.9508e-09 0.000130266
CTTCCGG 0.0809195 0.326803 4.73282 2.21426e-06 0.0362784
GACCTTG 0.0445977 0.528876 5.17365 2.29565e-07 0.00376119
CGGAAGT 0.0696552 0.377269 4.73151 2.2286e-06 0.0365133
TGACCTT 0.0471264 0.622672 6.87246 6.3104e-12 1.0339e-07