
Despite limitations in heterochromatic and centromeric regions, many plant repeats are resolved in the high-quality genome sequence of A. thaliana is not a perfect representative of all plants, the genome shows the characteristically high proportion of AT. Due to the availability of excellent genomic resources, we selected the well-established plant model organism Arabidopsis thaliana for our study. To expand the sparse knowledge about the performance of other read mapping and variant calling tools on plant data, we set out to perform a systematic comparison. A recent study compared the performance of BWA-MEM, SOAP2, and Bowtie2 with the two variant callers GATK and SAMtools/BCFtools on simulated and real tomato datasets. Due to substantial differences in the nucleotide composition, a dedicated benchmarking on plant genome sequences is advised. However, no comprehensive benchmarking study of read mapping and variant calling tools for plant genome sequences is described in the literature. Therefore, the diversity of plant genomes reveals the necessity of a benchmarking study using plant datasets. Moreover, many plant genomes possess unique challenges for variant calling, namely high amounts of repetitive sequences, large structural variations, and a broad range of heterozygosity and polyploidy. Although the applications in biomedical research and plant sciences differ substantially, plant scientists have largely followed benchmarking studies derived from research on human samples assuming similar performances. Many underlying algorithms of variant calling pipelines were developed for the analysis of variants in the human genome, e.g., to investigate genetic disorders or to study tumor samples. Detailed characteristics and algorithms of each mapper have been described elsewhere.
#BEST FREE BIOINFORMATICS SOFTWARE CLC SEQUENCE VIEWER SOFTWARE#
While most of these tools are freely available for academic use as command line versions, CLC Genomics Workbench is a proprietary software suite for genomics with a graphical user interface. Frequently applied read mappers are Bowtie2, BWA-MEM, CLC Genomics Workbench (Qiagen), GEM3, Novoalign, and SOAP2.

The particular challenges are low-complexity sequences, repetitive regions, collapsed copies of sequences, contaminations, or gaps in the reference genome sequence. Moreover, the quality of the reference genome sequence plays an important role for the performance of the mapper. Reads originating from PCR duplicates should be removed from the mapping prior to the variant calling to improve the reliability of the results. Consequently, the choice of tool and parameters can have a large influence on the outcome of the mapping.

Numerous mappers are available, which utilize different algorithms and criteria to generate alignments. Often, there is a trade-off between mapping speed and the quality of the resulting alignment. Sequence reads are aligned to a suitable, but not necessarily the correct place in the genome sequence. When looking at different performance metrics, BWA-MEM and Novoalign were the best mappers and GATK returned the best results in the variant calling step.Īs the read mapping determines the quality of the alignment, it is arguably the most important step.

We found that all investigated tools are suitable for analysis of NGS data in plant research. Sets of variants were evaluated based on various parameters including sensitivity and specificity. In this study, we evaluated the performance of 50 different variant calling pipelines, including five read mappers and ten variant callers, on six real plant datasets of the model organism Arabidopsis thaliana. Since most benchmarking studies of bioinformatics tools for this purpose have been conducted on human datasets, there is a lack of benchmarking studies in plant sciences.

Frequently applied pipelines involve the alignment of sequence reads against a reference sequence and the identification of sequence variants. However, the analysis of genomic data remains challenging and relies mostly on the performance of automatic pipelines. High-throughput sequencing technologies have rapidly developed during the past years and have become an essential tool in plant sciences.
