With NGS technologies, life sciences face a raw data deluge. Classical analysis processes of such data often begin with an assembly step, needing large amounts of computing resources, and potentially removing or modifying parts of the biological information contained in the data. Our approach proposes to directly focus on biological questions, by considering raw unassembled NGS data, through a suite of six command-line tools.
Dedicated to “whole genome assembly-free” treatments, the Colib’read project (continuing the alcovna project) uses optimized algorithms for various analyses of NGS datasets, such as variant calling or read set comparisons. Based on the use of de Bruijn graph and bloom filter, such analyses can be performed in few hours, using small amounts of memory. Applications on real data demonstrate the good accuracy of these tools compared to classical approaches. To facilitate data analysis and tools dissemination, we developed Galaxy tools and tool shed repositories.
The common denominator of all presented tools is the fact that they are all dedicated to the analysis of NGS datasets without the need of any reference genome.
Kissplice, DiscoSNP and TakeABreak perform de novo variant identification and quantification. For these tools the general approach consists in 1) defining a model for the seeked elements; 2) detecting in one or several NGS datasets those elements that fit the model; 3) outputting those together with a score and their genomic neighborhood. Mapsembler focuses on sequences of interest within a micro targeted assembly, LorDec uses short reads for correcting third generation long reads, and finally Commet is dedicated to the comparison of numerous metagenomic read sets.
Special care was given to limit both the memory and time requirements of all tools. Thus, five of the six tools are based on the usage of a compact representation of a de Bruijn graph.
The Colib’read team