Quality control by full-length deep sequencing

Introduction

In 2013 we have made a change to our quality control (QC) methods for PlasmoGEM vectors, in order to fulfill our goal of only releasing vectors of the highest possible quality.
Initially, PlasmoGEM vectors were tested using a combination of PCR and NotI restriction digest QC steps. More recently, we have started to sequence the full genomic DNA insert of all vectors in large batches using Illumina technology. With this we can now compare the complete sequence of each vector with the expected nucleotide sequence, which allows us to detect the vast majority of potential problems. We now use this as our primary QC for all new vectors and new vectors will only be released once they have passed QC. Please note that some already existing constructs are currently being re-examined with this method and may be taken off our database as a result of failing this new QC procedure.

Limits of QC by deep sequencing

We sequence our construct to a high depth and the results are therefore very reliable but there are limits to the technology. For example, we may not be able to map the short reads we obtain to some of the very AT rich and/or repetive regions of the genome, which may lead to an incorrect QC failure. We have also already noticed cases of known loci with higher-than-usual sequence variability, which may also lead to incorrect QC failure by sequencing.

How we decide on QC pass or failure

Large deletions, as well as SNPs and indels in coding sequences are the most serious types of quality issues, because they may lead to unintended mutations in tagged or neighboring genes. This may lead to an incorrect interpretation of your phenotype.
If we cannot provide a sequence perfect vector, we will tolerate SNPs and small indels in non-coding regions.

More information about this is summarised in the next section.

QC pass

The sequence of the homology region has been verified. However, we occasionally find small indels or single base changes in non-coding genomic sequence in the homology arms and such vectors still pass QC. The majority of these mutations are single base insertions or deletions in long homopolymeric tracts of A/T nucleotides. Many of these will have originated in E. coli but others may pre-exist in the parasite clone from which the gDNA library was created. Small variations in non-coding regions of low complexity are difficult to avoid entirely and the vast majority will neither affect vector function, nor influence downstream phenotypic analysis of resulting transgenic lines. These constructs pass QC so they can be used where no sequence perfect vector is available. Please note that the sequence of the selection cassette in the final vectors is not verified, but we do not anticipate this to be a problem.

QC failure

Detection of any one of the below feature(s)

Defective barcode module
The barcode, or one of the primer annealing sites flanking it, have a sequence discrepancy that was introduced during primer synthesis.
Large deletion
The vector carries a deletion in one of its homology arms. Such deletions could reduce the chance of a vector integrating into the genome by forcing recombination to happen inside of the deletion, or deletions might be introduced into the genome in addition to your intended mutation and lead you to misinterpret your phenotype.
Other fail reasons
The vector has a potential single base exchange or small indel that affects the protein-coding sequence or predicted splice site of any gene within the genomic locus.