Detailed Results

Figure S1: Top motif from PosMotif1 on samples of type real from the Tompa data set in Tompa's format (txt).

Table S1: Accuracy of the top motif from PosMotif1 on samples of type real from the Tompa data set (xls, txt).

Figure S2: Top motif from PosMotif2 on samples of type real from the Tompa data set in Tompa's format (txt).

Table S2: Accuracy of the top motif from PosMotif2 on samples of type real from the Tompa data set (xls, txt).

Figure S3: Top motif from PosMotif1 on samples that contain at least three genes in the SCPD database in Tompa's format (txt).

Table S3: Accuracy of the top motif from PosMotif1 on samples that contain at least three genes in the SCPD database (xls, txt).

Figure S4: Top motif from PosMotif2 on samples that contain at least three genes in the SCPD database in Tompa's format (txt).

Table S4: Accuracy of the top motif from PosMotif2 on samples that contain at least three genes in the SCPD database (xls, txt).

Figure S5: Top motif from PosMotif1 on samples from the ABS database in Tompa's format (txt).

Table S5: Accuracy of the top motif from PosMotif1 on samples from the ABS database (xls, txt).

Figure S6: Top motif from PosMotif2 on samples from the ABS database in Tompa's format (txt).

Table S6: Accuracy of the top motif from PosMotif2 on samples from the ABS database (xls, txt).


PosMotif Software

PosMotif is a motif finding algorithm for DNA sequences which uses a string representation that allows arbitrary ignored positions within the nonconserved portion of single motifs. It uses O(2^l) Markov chains to model the background distribution of motifs of length l by skipping these positions within each Markov chain.

The PosMotif software is available for download. It consists of four parts PreProcess, PosMotif, PostProcess and PostProcess2. The following steps will create a directory called posmotif.


Using PreProcess

The PreProcess source code consists of a single file preprocess.c that performs preprocessing of the background distribution. It can be compiled with the command "gcc -O3 -o preprocess preprocess.c".

PreProcess needs two command-line parameters in the following order:

The input (from stdin) consists of a set of background sequences in FASTA format. The output (to stdout) consists of a list of occurrence numbers of all strings of length at most max_len_tuple with len_freq positions that are not ignored, which will serve as a background file for PosMotif. Note that len_freq corresponds to the Markov order plus 1.

Example: use "preprocess 20 3 < yst1000.fasta > yst1000.txt" to create the background file yst1000.txt from all upstream sequences of length 1000 in yeast.


Using PosMotif

The PosMotif source code consists of a single file posmotif.c that implements the main algorithm. It can be compiled with the command "gcc -O3 -o posmotif posmotif.c".

PosMotif needs three command-line parameters in the following order:

The input (from stdin) consists of the input sample in FASTA format. The output (to stdout) consists of all motifs with E-value less than 1. Each motif's occurrences are shown followed by the motif itself and the number of occurrences that are counted for each sequence in obtaining the E-value (which is shown next). An E-value marked with - indicates that the motif occurrences overlap with the occurrences of some previous unmarked motif (this is not used further).

Example: use "posmotif yst1000.txt 18 3 < yst01r.fasta > yst01r.motif" to find motifs in the input sample yst01r.fasta.


Using PostProcess

The PostProcess source code consists of a single file postprocess.pl that performs the first postprocessing step. It can be applied with the command "perl postprocess.pl".

The input (from stdin) consists of the output from PosMotif. The output (to stdout) consists of motifs after first postprocessing. Each motif's occurrences are shown followed by the motif itself, its E-value, and its rank.

Example: use "perl postprocess.pl < yst01r.motif > yst01r.post" to perform first postprocessing of motifs in yst01r.motif.


Using PostProcess2

The PostProcess2 source code consists of a single file postprocess2.pl that performs the second postprocessing step. It can be applied with the command "perl postprocess2.pl".

The input (from stdin) consists of the output from PostProcess. The output (to stdout) consists of motifs after second postprocessing. Each motif's occurrences are shown followed by the original motif from which neighboring motifs are obtained, its E-value, the number of neighboring motifs, and its hybrid rank.

Example: use "perl postprocess2.pl < yst01r.post > yst01r.post2" to perform second postprocessing of motifs in yst01r.post.


Reference

Zhao X. and Sze S.-H. (2011) Motif finding in DNA sequences based on skipping nonconserved positions in background Markov chains. Journal of Computational Biology, 18, 759-770.