INSaFLU relies on a multi-software bioinformatics pipeline that will be under continuous development and improvement not only to enrich it with new features, but also to continuously shape the protocol to the best methodological advances in the field.
The current software settings, which were chosen upon intensive testing, are described below. For additional details about the bioinformatics pipeline, please visit the INSaFLU github account: https://github.com/INSaFLU/INSaFLU (more information for each software can also be found in the official repositories; links are also provided below).
The bioinformatics pipeline developed and implemented in the INSaFLU web platform currently consists of 6 core steps (see WorkFlow) yielding multiple graphical, and sequence outputs (see Output visualization and download menu for details)
INSaFLU bioinformatics pipeline workflow
Modules (Description, Current software versions and settings)
Read quality analysis and improvement¶
This step takes the input single- or paired-end reads (fastq.gz format) and produces Trimmomatic-derived quality processed reads, as well as FastQC quality control reports for each file, before and after quality improvement. This module is automatically run upon reads upload (i.e., no user intervention is needed).
FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) (version 0.11.5; date 15.01.2018)
input: single- or paired-end reads (fastq.gz format) (e.g., sample_L001_R1_001.fastq.gz and sample_L001_R2_001.fastq.gz for Illumina technology reads)
—nogroup option: all reports will show data for every base in the read.
Trimmomatic (http://www.usadellab.org/cms/index.php?page=trimmomatic) (version 0.27; date 15.01.2018)
input: single- or paired-end reads (fastq.gz format) (e.g., sample_L001_R1_001.fastq.gz and sample_L001_R2_001.fastq.gz for Illumina paired-end reads)
SLIDINGWINDOW: perform a sliding window trimming, cutting once the average quality within the window falls below a threshold (SLIDINGWINDOW:5:20, where 5 refers to window and 20 to the minimum average quality)
LEADING: cut bases off the start of a read, if below a threshold quality (LEADING:3). This will allow discarding bases with very quality or N bases (quality score of 2 or less).
TRAILING: cut bases off the end of a read, if below a threshold quality (TRAILING:3). This will allow discarding bases with very quality or N bases (quality score of 2 or less).
MINLEN: drop the read if it is below a specified length (MINLEN:35)
TOPHRED33: Convert quality scores to Phred-33
Files between 50 - 300 MB are downsized to ~50 MB before analysis by randomly sampling reads using fastq-sample from fastq-tools package https://github.com/dcjones/fastq-tools (developed by Daniel C. Jones email@example.com)
Variant detection and consensus generation¶
This key module takes advantage of the multisoftware tool Snippy (please visit the official repository to get details about each component; https://github.com/tseemann/snippy) to perform reference-based mapping, followed by SNP/indel calling and annotation and generation of consensus sequences (quality processed reads obtained through Trimmomatic analysis are used as input). A reference sequence is selected for each project after uploading it or from the INSaFLU default reference database. Uploaded “.fasta” files are annotated using Prokka upon submission and automatically become available at the user-restricted reference database. Each project should ideally include viruses from the same type and sub-type/lineage (this typing data is automatically determined upon reads submission to INSaFLU).
Prokka (https://github.com/tseemann/prokka) (version 1.12; date 15.01.2018)
–kingdom: defines the Annotation mode (Viruses)
–R1 (and –R2): define the reads files used as input, i.e, quality processed reads (e.g., sample_1P.fastq.gz and sample_2P.fastq.gz) obtained after Trimmomatic analysis
—ref: define the reference sequence selected by the users (.fasta or gbk format)
–mapqual: minimum mapping quality to allow (–mapqual 20)
—mincov: minimum coverage of variant site (–mincov 10)
–minfrac: minumum proportion for variant evidence (–minfrac 0.51)
Integrative Genomics Viewer (http://software.broadinstitute.org/software/igv/) (version 2.3.98; date 15.01.2018)
inputs: reference file (.fasta); mapping file (.bam; .bai)
This module yields a deep analysis of the coverage for each per sample by providing the following data: mean depth of coverage per locus, % of locus size covered by at least 1-fold and % of locus size covered by at least 10-fold. The latter fits the minimum depth of coverage for variant calling applied by INSaFLU pipeline and constitutes the guide for consensus generation, i.e., consensus sequences (see Module “Variant detection and consensus generation”) are exclusively provided for locus fulfilling the criteria of having 100% of their size covered by at least 10-fold. Depth of coverage plots are additionally generated and can be interactively viewed at INSaFLU.
getCoverage.py (https://github.com/monsanto-pinheiro/getCoverage) (version v1.1; date 15.01.2018)
-i: define the input files, i.e, the coverage files (.depth.gz) generated through Snippy
-r: define the reference sequence selected by the users (.fasta format)
-o: defines the output file name (tab-separated value)
This module uses filtered nucleotide consensus sequences and performs refined nucleotide/protein sequence alignments and phylogenetic inferences. These outputs are automatically re-build and updated as more samples are added to user-restricted INSaFLU projects, making continuous data integration completely flexible and scalable.
Users can also easily color the phylogenetic tree nodes and/or display colored metadata blocks next to the tree according to any combination of metadata variables, which facilitates the integration of relevant epidemiological and/or clinical data towards an enhanced genome-based pathogen surveillance.
MAUVE (http://darlinglab.org/mauve/mauve.html) (version 2.4.0; date 15.01.2018)
progressiveMAUVE module (default settings): this algorithm is applied to perform primary draft alignments, and has the particular advantage of automatically concatenating multi-fasta input sequences during whole-genome alignments construction.
input file: filtered nucleotide consensus sequences for each sample, one per each amplicon target (which are , in general, influenza CDSs) and another for the whole-genome sequence (i.e., the set of sequence targeted by the amplicon-based NGS shema, which, in general, is the pool of main 8 influenza CDSs). xmfa to fasta conversion is carried out using “convertAlignment.pl” (https://github.com/lskatz/lyve-SET/blob/master/scripts/convertAlignment.pl
MAFFT (https://mafft.cbrc.jp/alignment/software/) (version 7.313; date 15.01.2018)
For nucleotide alignments:
input file: progressiveMAUVE-derived draft alignments (multifasta format), one per each locus and another for the whole-genome sequence
For amino acid alignments:–amino: assume the sequences are in amino acid.
FastTree (http://www.microbesonline.org/fasttree/) (version 2.1.10 Double precision; date 15.01.2018)
Double-precision mode: suitable for resolving very-short branch lengths accurately (FastTreeDbl executable)
-nt: defines the input nucleotide alignment, which is a MAFFT-derived refined alignments (multifasta format). Alignments to be run include one per each locus and another for the whole-genome sequence.
—gtr: defines the Generalized time-reversible (GTR) model of nucleotide evolution (CAT approximation with 20 rate categories)
-boot: defines the number resample (-boot 1000)
Seqret EMBOSS tool (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/) (version 126.96.36.199; date 15.01.2018)
input file: nucleotide alignments in FASTA (.fasta) to be converted in NEXUS (.nex) format
MSAViewer (http://msa.biojs.net/) (latest; date 15.01.2018)
input files: consensus nucleotide alignments for each locus and for the consensus ‘whole-genome’ sequence (upon concatenation of all individual locus); and amino acid alignments for the encoded proteins
Phylocanvas (http://phylocanvas.org/) (version 2.8.1; date 15.01.2018)
input files: phylogenetic tree obtained from each locus-specific nucleotide alignment and from the alignment of the ‘whole-genome’ sequences (upon concatenation of all individual locus)
Metadata visualization tools were built with great contribution from Luís Rita: https://github.com/warcraft12321
Intra-host minor variant detection (and uncovering of putative mixed infections)¶
This module uses mapping data for the set of samples from each user-restricted INSaFLU project and provides a list of minor intra-host single nucleotide variants (iSNVs), i.e., SNV displaying intra-sample frequency between 1- 50%. This output is automatically re-build and cumulatively updated as more samples are added to each INSaFLU project, making continuous data integration completely flexible and scalable. Plots of the proportion of iSNV at frequency at 1-50% (minor iSNVs) and at frequency 50-90% detected for each sample are also provided as mean to a guide the uncovering of putative mixed infections (exemplified in the Figure). INSaFLU flags samples as “putative mixed infections” based on intra-host SNVs if the following cumulative criteria are fulfilled: the ratio of the number of iSNVs at frequency at 1-50% (minor iSNVs) and 50-90% falls within the range 0,5-2,0 and the sum of the number of these two categories of iSNVs exceeds 20. Alternatively, to account for mixed infections involving extremely different viruses (e.g., A/H3N2 and A/H1N1), the flag is also displayed when the the sum of the two categories of iSNVs exceeds 100, regardless of the first criterion.
Freebayes (https://github.com/ekg/freebayes) (version v1.1.0-54-g49413aa; date 15.01.2018)
–min-mapping-quality: excludes read alignments from analysis if they have a mapping quality less than Q (–min-mapping-quality 20)
—min-base-quality: excludes alleles from iSNV analysis if their supporting base quality is less than Q (–min-base-quality 20)
–min-coverage: requires at least 100-fold of coverage to process a site (–min-coverage 100)
—min-alternate-count: require at least 10 reads supporting an alternate allele within a single individual in order to evaluate the position (–min-alternate-count 10)
–min-alternate-fraction: defines the minimum intra-host frequency of the alternate allele to be assumed (–min-alternate-fraction 0.01). This frequency is contingent on the depth of coverage of each processed site since min-alternate-count is set to 10, i.e., the identification of iSNV sites at frequencies of 10%, 2% and 1% is only allowed for sites with depth of coverage of at least 100-fold, 500-fold and 1000-fold, respectively.