01.09.2021

CliqueSNV — tool for SARS-CoV-2 evolution studies

The current pandemic, caused by the SARS-CoV-2 virus, is rapidly developing — to a great extent because this coronavirus is able to quickly evolve. And many other RNA viruses act in the same manner, continuously producing minority haplotypes that may become dominant and more dangerous. Early detection and identification of minority viral haplotypes is essential for quick medical response. Researchers from Sechenov University’s Centre for Bio- and Chemoinformatics at the Institute of Biodesign and Modelling of Complex Systems — in co-operation with scientists from Georgia State University (Atlanta, GA, USA) and the Centres for Disease Control and Prevention (CDC, USA) — have developed an algorithm and software which help accurately determine the intrahost variability of viral genomes and correct sequencing errors. This work has been published in the journal Nucleic Acids Research (impact factor 16.971).

Minority haplotypes can be detected by next-generation sequencing (NGS), however, sequencing noise affects the accuracy. Removal of sequencing noise to detect evolutionary close haplotypes still remains a complicated task. The authors of the paper proposed an algorithm, called CliqueSNV, which extracts pairs of statistically linked mutations from noisy reads. This procedure considerably reduces sequencing noise and facilitates the identification of minority intrahost haplotypes with the frequency below the sequencing error rate.

‘We embarked on this project in 2017’, said Yuri Porozov, Head of Sechenov University’s Centre for Bio- and Chemoinformatics. ‘My former graduate student Sergey Knyazev, who was already working at the Centres for Disease Control and Prevention (CDC) and Georgia State University (Atlanta, GA, USA) under supervision of Prof Alex Zelikovsky (co-author of the publication), invited my colleague, Tatiana Malygina, and me to participate in the development and benchmarking of a new algorithm for analysing intrahost variability of viral haplotypes and reliable determination of minority haplotypes in noisy sequencing results’.

‘At that time, we could not even imagine that in a couple of years we would have to deal with SARS-CoV-2. This virus is characterised by intrahost variability, and evolution of majority viral haplotypes is a big problem’.

‘The lives of many people, as well as solutions for economy, education, and healthcare depend on understanding the evolution of the coronavirus and its control. This idea has been developed in a recent key paper published in Scientific Reports’.

The CliqueSNV algorithm filters sequencing noise and isolates minority viral haplotypes from noisy data. One sequencing strategy is NGS; it is used to assemble genomes from short reads, often using the ‘shotgun approach’. This task, difficult in itself, is much more complicated in the case of a heterogeneous viral environment where it is necessary to identify closely related viral variants and estimate their frequencies derived from one particular human (intrahost variability).

On the other hand, the PacBio single-molecule technique enables the sequencing of the entire viral genome, but it also introduces significant noise — around 13–14%. There are several tools for assembling viral haplotypes, but all of them have issues with scalability, identification of haplotypes below sequencing error rates, and more.

The new algorithm, CliqueSNV, is based on statistical analysis and cliques. It was tested on open datasets — HIV genomes. Also, the researchers introduced two novel sequencing HIV-1 benchmarks results from a patient.

During the tests, it turned out that CliqueSNV could correctly restore up to 87% of the haplotypes of the intrahost viral population, while other software solutions were not able to do this without errors even with one haplotype. And, importantly, CliqueSNV could correctly identify minority viral haplotypes — whose frequency did not exceed 0.1% — and distinguish between the haplotypes that differed by only two nucleotide pairs.

‘It is very pleasing that this project — which we started four years ago — was completed and published at a very appropriate time’, Yuri Porozov concluded. ‘I am sure that our algorithm will help understand the evolution and epidemiology of SARS-CoV-2, this way providing more options to suppress and control the coronavirus pandemic’.

The CliqueSNV tool can be found on GitHub.

Read more: Knyazev S, et al. Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction. Nucleic Acids Res (2021).