MotifEnumerator Software

MotifEnumerator is an improved pattern-driven algorithm for motif finding in DNA sequences proposed in Sze and Zhao (2006).

Depending on whether mismatches and don't cares are allowed, MotifEnumerator has different time and space complexities when l is large enough:

Mismatches allowed: O(4^ll) space, l≤13 on 32-bit systems
- Don't cares not allowed: O(4^llk) time
- Don't cares allowed: O(5^llk) time
Mismatches not allowed: O(lkn) space
- Don't cares not allowed: O(lkn) time
- Don't cares allowed: O(2^llkn) time

where k is the number of sequences, n is the length of each sequence and l is the motif length.

Using MotifEnumerator

The MotifEnumerator source code consists of a single file motifenumerator.c. It can be compiled under the Unix/Linux/Windows(Cygwin) environment with the command "gcc -O3 -o motifenumerator motifenumerator.c".

MotifEnumerator needs four command-line parameters in the following order:

max_len_tuple (l): maximum motif length including don't care positions
max_len_word (l'): maximum number of positions within a motif that are not don't cares (set l'=0 to disallow don't cares)
max_dist (d): maximum number of mismatches allowed
num_strand: 1 means forward strand only, 2 means both strands

The input sample (from stdin) consists of a set of sequences in FASTA format, while the output (to stdout) shows a set of non-overlapping motifs with e-value below 1.0.

The occurrences of each motif are displayed in the format "string seq/pos", where "string" is the motif occurrence, "seq" is the sequence name, and "pos" is the position within the sequence ('-' means reverse strand, while '+' denotes occurrences added after refinement).

For each motif, an additional line following the occurrences displays the motif pattern before refinement (with '-' denoting don't cares), the value of d, and the e-value.

Examples (l=12, num_strand=1, and input sample.fasta):

To allow mismatches but disallow don't cares, use "motifenumerator 12 0 12 1 < sample.fasta"
To allow mismatches and don't cares, use "motifenumerator 12 12 12 1 < sample.fasta"
To disallow mismatches and don't cares, use "motifenumerator 12 0 0 1 < sample.fasta"
To disallow mismatches but allow don't cares, use "motifenumerator 12 12 0 1 < sample.fasta"

Reference

Sze S.-H. and Zhao X. (2006) Improved pattern-driven algorithms for motif finding in DNA sequences. Proceedings of the 2005 Joint RECOMB Satellite Workshops on Systems Biology and Regulatory Genomics, Lecture Notes in Bioinformatics, 4023, 198-211.