Post-Processing Software

The program performs post-processing on the de Bruijn graph from the Velvet assembler to construct splicing graphs for RNA-Seq libraries, which preserve alternative splicing information. For each node in each splicing graph, the expression level is reported as the number of reads per kilobase of node per million reads (RPKM) with respect to each library.

The source code consists of a single file postprocess.c. It can be compiled with the command "gcc -O3 -o postprocess postprocess.c".

Steps

For each RNA-Seq library, trim each read based on quality score. For reads from a Solexa library, a possible strategy is to remove all positions including and to the right of the first position that has quality score of less than a cutoff such as 15. For reads from a 454 library, a possible strategy is to remove the reads with more than half of its lengths having a quality score of less than a cutoff such as 15. Primer and internal adapter sequences should be removed.
Run the Velvet assembler with the following commands:
- velveth output_directory hash_length solexa_1 ... solexa_m -long 454_1 ... 454_n
- velvetg output_directory -cov_cutoff cutoff -max_branch_length 0 -max_divergence 0 -max_gap_count 0 -read_trkg yes
where solexa_1 ... solexa_m are m solexa files (one for each library), and 454_1 ... 454_n are n 454 files (one for each library). Use -fasta or -fastq to specify appropriate file formats.
Run the post-processing program with the following command:
- postprocess solexa_1 ... solexa_m 454_1 ... 454_n < output_directory/LastGraph > output_file
The read files should be the same as in Velvet and in exactly the same order. Omit flags such as -long, -fasta, and -fastq in the postprocess command. Only input files in fasta or fastq format are supported.
Perform downstream analysis using output_file.

Output

The splicing graphs are represented in an annotated fasta format, in which each potentially non-linear structure is given as a collection of nodes, with connecting edge information embedded within the node names. Different splicing graphs are separated by blank lines.

Each node name is given as >NODE_u:v_1,v_2,...,v_p, where u is the ID of the current node, and u -> v_1, u -> v_2, ..., u -> v_p are edges in the splicing graph, following by one RPKM value for each library that are listed in the same order as the read files.

SNPs are reported within the sequences as IUPAC letters that are not A, C, G, T.

Lucilia sericata Transcriptome Assemblies

postprocess.zip

Reference

Sze S.-H. and Tarone A.M. (2014) A memory-efficient algorithm to obtain splicing graphs and de novo expression estimates from de Bruijn graphs of RNA-Seq data. BMC Genomics, 15(Suppl 5), S6.

Sze S.-H., Dunham J.P., Carey B., Chang P.L., Li F., Edman R.M., Fjeldsted C., Scott M.J., Nuzhdin S.V. and Tarone A.M. (2012) A de novo transcriptome assembly of Lucilia sericata (Diptera: Calliphoridae) with predicted alternative splices, single nucleotide polymorphisms, and transcript expression estimates. Insect Molecular Biology, 21, 205-221.