motifADE

A tool for identifying motifs associated with differential expression (and much more!)
motifADE is free software distributed under the GNU General Public Licence, version 3.
The original motifADE source code from Mootha et al. PNAS 2004 is available upon request.
Contact: Dan Arlow < arlow [at] broad [dot] mit [dot] edu >
Downloads
motifADE source code [motifADE_src.tar.gz]
UCSC genome browser TSS +/- 1kb sequences for all mm8 and hg18 RefSeq transcripts [tx_1000_1000.tar.gz]
UCSC genome browser 3’ UTR sequences for all mm8 and hg18 RefSeq transcripts [3utr.tar.gz]
Annotation files for mapping Affymetrix probe set IDs to RefSeq transcript IDs [annotation_files.tar.gz]
Orthology files for mapping between orthologous mouse and human RefSeq Transcript IDs [orthology_files.tar.gz]
Example differential expression data [example_data.tar.gz]
Installing
Compiling the sources is easy on most unix/linux distributions; simply navigate to the directory with the source code and type “make”, and a binary called “motifADE” should appear in the directory. It’s up to you to move it to a more appropriate place.
README
motifADE 2007

A tool for discovering motifs associated with differential expression
(and much more!)


Copyright (C) 2003-2008 Dan Arlow

Vamsi Mootha Laboratory
  Broad Institute of MIT and Harvard
  Department of Systems Biology, Harvard Medical School


********************************************************************************
motifADE is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

motifADE is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with motifADE.  If not, see <http://www.gnu.org/licenses/>.
********************************************************************************


This software package implements the motifADE algorithm described in "Erra and
Gabpa/b specify PGC-1a-dependent oxidative phosphorylation gene expression that
is altered in diabetic muscle" Published online on April 20, 2004, (doi:
10.1073/pnas.0401401101) and extensions thereto.

The typical way to run motifADE is to provide differential gene expression data
and upstream sequences from all genes, and motifADE produces putative
transcription factor binding sites that are significantly associated with
induction and repression of genes using a rank-sum test. The "typical run"
methodology is similar to Reduce by Bussemaker et al. and PREGO by Tanay, except
that motifADE can use sequence conservation of motifs in orthologous regions to
improve specificity, and motifADE can appropriately handle input sequences of
varying lengths. A common variation is to provide 3' UTR sequences to identify
putative microRNA target sites. motifADE also has tools for testing if a priori
defined motifs are significantly associated with differential expression in a
data set, tools for testing if motifs are over-represented in a given set of
genes, and tools for extracting the flanking sequence around motif instances.



SYNOPSIS:
    
usage: motifADE [-u num_samples | -d | -t | -s | -n | -f gene_set | -h gene_set]
                [-a adj_sig_level | -p nominal_sig_level]
                [-b] [-r neighborhood_radius]
                [-k kmers | -g gaps | -m iupac | -w pwms | -S incidence]
                [-i | -I | -L bp_up,bp_down]
                [-O order_file | -E expression_file] [-A annotation_file] [-M]
                [-P] [-v]
                [-C orthology_file -o FASTA_ortholog_file] FASTA_promoter_file

DESCRIPTION:

    motifADE searches for motifs associated with differential gene expression
    using the algorithm described in "Erra and Gabpa/b specify PGC-1a-dependent
    oxidative phosphorylation gene expression that is altered in diabetic
    muscle" published 4/20/2007 in Proc. Natl. Acad. Sci. USA. with several
    extensions.
    
    To run motifADE in the way described in Mootha et al., you must provide a
    FASTA file of upstream sequences, and expression data from the genes
    downstream of those sequences. Each sequence in the FASTA file must contain
    a unique alphanumeric sequence ID in its header (typically a RefSeq
    Transcript ID.) To provide the expression data, you have four options:
    
    1. a text file of ordered sequence IDs, one per line
    
    2. a text file in which each line contains an sequence ID followed by a tab
       and a numerical measure of differential expression
    
    3. a text file of rank-ordered Affymetrix probe set IDs, one per line
    
    4. a text file in which each line contains an Affymetrix probe set ID
       followed by a tab and a numerical measure of differential expression
    
    Note that for 3 and 4, you must additionally provide an Affymetrix CSV
    annotation file for associating the probe set IDs with the RefSeq Transcript
    IDs in the FASTA file.
    
    In addition to finding motifs associated with differential expression,
    motifADE can test for enrichment of motifs within an a priori-defined gene
    set, like most other motif analysis tools.
    
    motifADE can use information about conservation between species to enhance
    the specificity of its motif-finding algorithm. To search for conserved
    motifs, you must provide an orthology table from JAX labs and a FASTA file
    of orthologous upstream sequences.
    
    motifADE can test five different types of motifs:
    
    1. all "k-mer" motifs, i.e. all sequences over [ACGT] with k letters
    
    2. all "gap-k-mer" motifs of a particular format, i.e. all ways of replacing
       the x's in the string xxx--xxx with letters from [ACGT], and the dashes
       represent "don't cares"
    
    3. explicitly specified motifs over the IUPAC degeneracy alphabet, possibly
       including several patterns separated by the alternation symbol "|"
    
    4. position-specific scoring matrix ("weight matrix") motifs, given in a
       format explained below
    
    5. pre-computed motif target sets; sets of genes can be tested for
       association with differential expression as if they contained some motif
       (a bit like Gene Set Enrichment Analysis.)
    
    
OPTIONS:
    
    Statistics Options
    
    -u [num_samples]
        Use the Mann-Whitney test
        
        The default behavior is to use the standard Mann-Whitney test, but in
        cases when the sequences are of unequal lengths, it is more appropriate
        to assume a null distribution in which the probability that a sequence
        is chosen is proportional to its length. This option estimates the
        parameters of a normal null distribution for the rank-sum statistic
        using num_samples Monte Carlo samples.
    
    -d
        Use the Kolmogorov-Smirnov test
    
    -t
        Use Student's t-test
    
    -s
        Use Student's t-test for stepwise regression
        
        This option is not recommended. The motifADE implementation of stepwise
        regression simply scans motifs in arbitrary order using the t-test and
        when a motif passes the significance threshold, subtracts the effect
        (i.e. difference of means) from the differential expression data, and
        continues along. A better implementation would test all motifs and then
        accept the most significant motif, and then repeat the process, but this
        would be more computationally intensive.
    
    -f gene_set
        Test the enrichment of motifs in the set of genes listed in the file
        gene_set, (one per line,) using the normal approximation to the
        binomial cumulative distribution
    
    -h gene_set
        Test the enrichment of motifs in the set of genes listed in the file
        gene_set using the hypergeometric cumulative distribution
    
    -p nominal_sig_level
        Only report statistics for motifs whose nominal P-values are smaller
        than nominal_sig_level
        
    -a adj_sig_level
        Only report statistics for motifs whose nominal P-values are smaller
        than adj_sig_level divided by the total number of motifs tested in this
        run. (e.g. For typical k-mer scanning, the threshold is set to
        adj_sig_level / 4^k for each k)
    
    -n
        Do nothing; useful with reporting options like -M when you just want to
        know the mapping results
    
    
    Conservation Options
    
    -C orthology_file
        JAX labs orthology table
    
    -o FASTA_ortholog_file
        FASTA file of orthologous upstream sequences
    
    
    Expression Data Options
    
    -O order_file
        Use the rank-ordered sequence IDs or Affymetrix probe set IDs in
        order_file (one per line) representing differential expression data
    
    -E expression_file
        Use the sequence IDs or Affymetrix probe set IDs with numerical
        measures of differential expression in expression_file (tab-delimited)
        for differential expression data
    
    -A annotation_file
        Use the Affymetrix CSV annotation file (or a custom CSV or TSV file
        with the columns "Probe Set ID" and "RefSeq Transcript ID") to map
        Affymetrix probe set IDs to RefSeq Transcript IDs in the FASTA file
    
    -M
        Print information about the number of sequences in the FASTA file, the
        number of entries in the expression file, and the number of sequences
        that were ultimately associated with exactly one expression entry
    
    
    Motif Discovery Options
    
    -k kmer_sizes
        Scan all k-mer motifs of the given sizes; kmer_sizes is a comma-
        delimited list of integers e.g. 6,7,8,9
    
    -g gap_formats
        Scan all "gap-k-mer" motifs in each of the given formats; gap_formats
        is a comma-delimited list of gap formats e.g. xxx-xxx,xxx--xxx,xx-xx-xx
    
    -m motif_file
        Scan IUPAC alphabet motifs specified in motif_file
    
    -w pssm_file
        Scan PSSM motifs from pssm_file given in the format described below
    
    -S incidence_file
        Scan pre-computed target sets from incidence_file in the "incidence
        list" format described below
    
    -b
        "Bidirectional" scanning -- allow instances of the motifs to occur on
        either strand of DNA; cannot be used with -S
    
    -r neighborhood_radius
        Approximate motif search; match motifs with up to neighborhood_radius
        mismatches from the motifs being scanned; cannot be used with -w or -S
    
    
    Reporting Options
        
    -M
        Print information about the number of sequences in the FASTA file, the
        number of entries in the expression file, and the number of sequences
        that were ultimately associated with exactly one expression entry
    
    -P
        Print the sequences that were ultimately used, with annotation in their
        headers containing the expression data that was ultimately used
    
    -v
        Verbose mode; print extra information about some tasks as they are
        performed
    
    
    Incidence Reporting Options
    
    -i
        Report the incidence of the motifs in "incidence list" format
        (described below)
    
    -I
        Report the incidence of the motifs in "incidence matrix" format; i.e.
        a matrix with one row for each gene and one column for each motif, with
        a 1 in the cell (i,j) if motif j occurs in gene i, otherwise 0
    
    -L bp_up,bp_down
        Report information about each instance of each identified motif,
        including the ID of the sequence in which it was found and associated
        differential expression value, (if applicable,) and also the sequence
        of the instance and bp_up and bp_down of flanking sequence surrounding
        it. NOTE: both bp_up and bp_down are measured from the START OF THE
        INSTANCE, i.e. if you are searching for TGACCTTNA and you want 3 bp
        upstream and downstream, then you have to use -L 3,12 because the motif
        is 9 bases long
    
    
    FILE FORMATS
    
    FASTA sequence file
    
    Standard FASTA format for DNA sequence with a required header convention;
    for each sequence, ">" followed by a unique alphanumeric sequence ID string
    and possibly a space followed by any additional information which motifADE
    will ignore. Subsequently, DNA sequence over the alphabet [ACGTN], spanning
    any number of lines. Sequences on separate lines will be concatenated.
    
    example:
    
    >seq_001
    CATGACTGCATGCAGAGGTCATACTGTCGATGCATGATGACCTCTG
    >seq_002
    ACTCTTCCGGTACTGACCTGTACGATTACGTACGTCAGATCGACGTACTGAC
    >SEQ_003
    CATGACGTTACGTAGGTCAACGATCTGTTGACCTACG
    
    
    IUPAC motif file
    
    Tab-delimited file of the format NAME [tab] PATTERN [newline], where NAME
    is a user-specified name for the motif, and PATTERN is a set of sequences
    over the IUPAC degeneracy alphabet separated by "|"s. For the IUPAC code,
    see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
    
    example:
    
    HS1	AGGTCA
    HS2	TGACCT
    NRF1	GCGCAYGCGC|GCGCRTGCGC
    NRF2	CTTCCG|CGGAAG
    
    
    PSSM motif file
    
    File format for representing position-specific scoring matrix ("weight
    matrix") motifs. For each motif, ">" followed by a user-specified name for
    the motif, then a tab, then the threshold log-score for the motif, then a
    newline, the string "A [tab] C [tab] G [tab] T", then a newline, then
    counts or frequencies for each position of the matrix separated by newlines
    
    example:
    
    >V$E2F_02	-1.18562366565774
    A	C	G	T
    0	0	0	12	T
    0	0	0	12	T
    0	0	0	12	T
    0	6	6	0	S
    0	2	10	0	G
    0	12	0	0	C
    0	0	12	0	G
    0	11	1	0	C
    >V$NFKAPPAB50_01	-1.40375929381938
    A	C	G	T
    0	0	18	0	G
    0	0	18	0	G
    0	0	18	0	G
    2	0	16	0	G
    16	1	0	1	A
    0	0	3	15	T
    0	7	1	10	Y
    0	16	0	2	C
    0	18	0	0	C
    0	17	1	0	C
    
    
    Incidence list file
    
    First a header line of the format "Motif [tab] Incidence", then for each
    motif, NAME [tab] INCIDENCE, where NAME is the motif's user-specified name,
    and INCIDENCE is a comma-delimited list of sequence IDs for sequences that
    contain the motif, or expression IDs if -A was used.
    
    example:
    
    Motif	Incidence
    HS1	seq_001,seq_003
    HS2	seq_001,seq_002,seq_003
    NRF2	seq_002

        
SAMPLE OUTPUT
    
    ./motifADE -k 7 -a .05 \
        -A annotation_files/MG_U74Av2_annot.maximum_matching.tsv \
        -E example_expression_data/PGC_day3_snr \
        -C orthology_files/JAX_mouse_to_human.tsv \
        -o sequence_databases/tx/ucsc.refGene_hg18_tx_1000_1000 \
        sequence_databases/tx/ucsc.refGene_mm8_tx_1000_1000
    
    Motif   Frequency Delta-Median Z-score P-value     Adjusted P-value
    AAGGTCA 0.0498851 0.568521     5.7695  7.9508e-09  0.000130266
    CTTCCGG 0.0809195 0.326803     4.73282 2.21426e-06 0.0362784
    GACCTTG 0.0445977 0.528876     5.17365 2.29565e-07 0.00376119
    CGGAAGT 0.0696552 0.377269     4.73151 2.2286e-06  0.0365133
    TGACCTT 0.0471264 0.622672     6.87246 6.3104e-12  1.0339e-07