[page edited on October, 27, 2016] - 2.2.10 version
DiscoSnp major update: DiscoSnp becomes DiscoSnp++
Major modifications were made on DiscoSnp these last few weeks. Improvements are the following:
- Thanks to the recoding using the GATB library:
- An even quicker execution speed, and the parallelization of the kissnp2 module.
- An improved progression message
- A unique file for storing the graph. This file (.h5) may be used in any GATB tool.
- DiscoSnp is no more limited to isolated SNP detection:
- Up to P (parameter) close SNPs may be found within a unique bubble
- Insertions and deletions of length lower or equal to D (parameter) are also detected.
Regarding these modifications, and as DiscoSnp is not limited only to SNP prediction anymore, we decided to change also its name.
Thus DiscoSnp becomes DiscoSnp++.
Please post feedbacks and comments on the biostar forum.
A toturial can be followed from those slides demo_discosnp
Software discoSnp++ is designed for discovering Single Nucleotide Polymorphism (SNP) and insertions/deletions (indels) from raw set(s) of reads obtained with Next Generation Sequencers (NGS).
Note that number of input read sets is not constrained, it can be one, two, or more. Note also that no other data as reference genome or annotations are needed.
The software is composed by two modules. First module, kissnp2, detects SNPs from read sets. A second module, kissreads2, enhance the kissnp2 results by computing per read set and for each variant found i/ its mean read coverage and ii/ the (phred) quality of reads generating the polymorphism.
A VCF file using or not a reference genome is also created.
Input (what about read pairs?)
discoSnp takes raw NGS datasets as inputs (fasta, fastq, gzipped or not). No reference genome is required. Read pairs can be given, however the pair information are useless in this framework. The detected SNPs are output in the contig they belong to and the contig length does not depend on pairing information. By the way, two reaf files correspond to paired reads should belong to the same file of files. (see documentations)
Here are a few slides about discoSnp: colloque_GE_2013_discoSnp
Paper & Citation
Uricaru, Raluca; Rizk, Guillaume; Lacroix, Vincent; Quillery, Elsa; Plantard, Olivier; Chikhi, Rayan; Lemaitre, Claire; Peterlongo, Pierre. (2014). Reference-free detection of isolated SNPs. Nucleic Acids Research. doi:10.1093/nar/gku1187
C. Riou, C. Lemaitre, and P. Peterlongo, “VCF_creator: Mapping and VCF Creation features in DiscoSnp++”. Poster at Jobim 2015
For remark and question, please use the biostar forum
Please read and accept the GNU AFFERO GENERAL PUBLIC LICENSE before use and diffusion.
Last stable version 2.2.10: download link
- Mac & Linux binaries
- 25/10/2016 2.2.10
- python scripts compatibles with python 3
- 28/06/2016 2.2.9
Fixed a VCF creator bug
Optimising VCF creator (approx 3 times faster)
- 04/05/2016 2.2.8
- Kissread tiny bug fixe
- Adding contiguous integration
- 15/04/2016 2.2.7 (now uses the githup repository: http://github.com/GATB/DiscoSnp)
Adding the possibility to limit the number of symmetrically branching crossroads traversed during the bubble finding
b2 mode: explore all possible symmetrical paths, even in case of success on one of the paths
b2 mode: avoid redundancies
Increased the maximal number of close SNP detectable thanks to a non recursive part on the bubble enumeration.
Fixed VCF creation bugs
- Fixed Prediction bugs
Improved VCF creator error messages in case of missing values from BWA results
Increased the max breadth for the indel detection
- 24/11/2015 DiscoSNP++-2.2.4
- Fixes a tiny bug in the run_discoSnp++.sh script. Thanks Hanan (https://www.biostars.org/p/155781/#167002)
- 18/11/2015 DiscoSNP++-2.2.3
- Dump de read file names. C_1, C_2, … are provided in the .fa and the vcf file. Now a file indicates the correspondence between C_i and a set of read files. See documentation for details.
- Removes indels if the repeat size is higher than a user defined threshold (max_ambigous_indel). Indels with Long repeat size (eg >20 in our tests) very often are false positives.
- BUG correction:
- VCF bug described here: https://www.biostars.org/p/166298/ is now fixed.
- kissread bug when P > 1 (segfault corrected).
- 05/10/2015 DiscoSnp++-2.2.1
- BUG correction:
- Kissreads module time (bug fixed by Guillaume Rizk) and memory.
- VCF creator bug correction
- Redundant bubble detection suppression
- VCF creators uses BWA MEM by default
- BUG correction:
- 17/07/2015: Important update – DiscoSNP++-2.2.0
- Input read set format has changed. Use now file of files. This provides an easier way of dealing with read sets composed of several read files (pair end or pools). See the documentation in the doc directory
- The kmer coverage threshold can be
- set separately for each read set
- and/or automatically detected
- If a reference genome is provided, it can used for predicting variants. For instance a unique read set may be compared to the reference genome. (option -R)
- With respect to previous change (-R) the read coherency in kissreads has changed.
- Before: a variant was read coherent if its two path were read coherent
- After: a variant is read coherent if at least one of its two paths is read coherent (else all homozygous calls obtains comparing a read set to a reference would be uncoherent)
- Kissreads parallelization had been improved. OMP is not used anymore, and running time are decreased.
- Two memory bugs have been fixed. They occurred mainly while using large number of read sets.
- It is now possible to detect only indels (ie. -P 0 detects no SNP)
- Memory issue detected:
- All tools (kissnp2, kissreads2, VCF_creator) have a limited memory footprint. However, due to kissnp parallelization, in some cases, memory may increase linearly with the number of used cores. If the memory is too high, limit the number of cores with the -u option.
- 13/05/2015: DiscoSNP++-2.1.7
- Fixes a bug with very long reads
- 04/05/2015: DiscoSNP++-2.1.6 + (mac and linux) binaries: http://gatb.inria.fr/binaries-url/
- Adding the vcf doc
- Adding the -u option (limiting the maximal number of used threads)
- Fixing some compilation bugs
- 03/05/2015: DiscoSNP++-2.1.5 +
- Fixes compilation bugs with some compilers
- Fixes some VCF generation bugs
- 02/04/2015: DiscoSNP++-2.1.4
- Fixes a redundancy bug with the b 2 option
- Generates an IGV compatible VCF
- Fixes various small VCF bugs
- 23/03/2015: DiscoSNP++-2.1.3
- Automatically creates a VCF
- Genotyping of results (for diploïds)
- Warning: documentation not up to date.
- 02/03/2015: DiscoSNPpp-2.0.6.
- Genotypes are automatically computed
- 24/02/2015:DiscoSNPpp-2.0.5, fixes a compilation bug on some OS.
- Fasta headers more informatives
- Update of the kissread tool for close SNPs precision
- Unique id for SNPs and indels
- Documentation update
- 26/01/2015: DiscoSNPpp-2.0.2
- Fixed a bug progress bar display on macos.
- 22/01/2015: DiscoSNPpp-2.0.1
- Fixed a bug about command line parser.
- 19/01/2015: DiscoSNPpp-2.0.0
- Corrected the discoSnp++2csv.py file
- Faster, nicer presentation, unique .h5 graph file
- Detects also indels and close SNPs
- 29/10/2014: [BETA] discoSnp_1.2.6: Kissreads changes: added minimal spanning option – changed the strategy when a read has multiple hits on a fragment.
- 01/10/2014: discoSnp_1.2.5: fixed bugs of version 1.2.4 (bugs were not affecting version <1.2.4).
- 25/09/2014: discoSnp_1.2.4: increased kissread speed (approx x2) –– bugged version — don’t use
- 28/07/2014: discoSnp_1.2.3: improved the documentation – cleaned the output messages and the output useless files
- 20/05/2014: discoSnp_1.2.2: -b 0 is the default option.
- 26/04/2014: discoSnp_1.2.1: Fixes a compilation problem due to a makefile typo.
- 05/03/2014: discoSnp_1.2.0: New option with regards to branching bubbles
- 14/11/2013: discoSnp_1.0.1: fixes a bug concerning the contigs extensions (note that unitig extension lengths were not affected)
- 14/10/2013: discoSnp_1.0.0: read2SNPs name changes. discoSnp_1.0.0 is a new name for read2SNPs_126.96.36.199
- discoSnp can be used on the GenOuest galaxy server
- A GenOuest account is needed
- discoSnp can be integrated in your galaxy instance using the GenOuest Toolshed (section Symbiose or Next generation Sequencing)
- directly via the toolshed (without authentification) by downloading the source code latest version (in zip, tar.gz or tar.bz2 format)
- by adding the GenOuest toolshed in the “tool_sheds_conf.xml” file in your Galaxy configuration and by installing discoSnp within the Admin panel (Search and browse tool sheds)
Packages debian and ubuntu:
NAR Paper datasets
We believe that the datasets that were used for testing discoSnp may be useful for testing similar tools. All simulated datasets presented in the NAR paper are available from this web site.
n Coli datasets
- Simulated genomes: http://www.irisa.fr/symbiose/people/ppeterlongo/discoSnp_data/coli/simulated_genomes/coli_genomes.zip
- Simulated reads: http://www.irisa.fr/symbiose/people/ppeterlongo/discoSnp_data/coli/simulated_reads/coli_reads.zip
- Reference snp sets (formatted as the discoSnp ouput): http://www.irisa.fr/symbiose/people/ppeterlongo/discoSnp_data/coli/reference_snp_bubbles/reference_snp_coli.zip
- DiscoSnp predictions: http://www.irisa.fr/symbiose/people/ppeterlongo/discoSnp_data/coli/discoSnp_results/coli_res_discoSnp.zip
- VCF files used for generating SNPs:
- Reference snp set (formatted as the discoSnp ouput)
- Simulated read sets (2x4GB)
- DiscoSnp predictions