Mutation analysis of production strains based on DNA sequencing and bioinformatic evaluation, by Dr Marcus Droege, Dr Dave Brett, Dr Jacqueline Weber-Lehmann, and Dr Kai Wilkens
Microbial production strains have been used for many years now for the production of vitamins (Roessner et al, 2002), feed additives (Pfefferle et al, 2003), vaccines (Xu et al, 2002 ), antibiotics (Vanden Boom, 2000) and drug components, among others.
In this respect the decisive parameter is, apart from the quality of the substance to be produced, the fermentation efficiency.
To achieve this many production strains currently used have been optimised with regard to production yield.
A random mutation is generated in the respective strain by exposing it, eg, to UV light or strongly mutagenous agents.
In the next step the desired phenotype is selected, eg, clones that produce considerably higher amounts of a certain vitamin compared to the original strain.
With the development of methods on molecular genetic level researchers additionally have become able to amplify specific genes in production strains or to integrate genes from other bacterial organisms (eg, Fernandez-Gonzalez et al, 1996).
Today the sequencing of complete bacterial genomes has become an established process.
It offers the possibility of differential analysis, ie, the analysis of molecular genetic differences between wild type or reference strains and the strains deferred from them.
A comparison of the sequences often opens up completely new insights into additional options for optimising production strains.
It therefore is of great importance to the fermentation industry.
MWG Biotech has many years of experience in the analysis of microbial genomes from the sequencing of more that 20 complete bacteria.
DNA sequencing of the genome of a production strain.
The sequencing of a microbial genome is based on the Whole Genome Shotgun process.
For this, high molecular genomic DNA of the production strain to be analysed is isolated and subsequently submitted to a random fragmentation process.
The average size of the resulting so-called shotgun fragments has been adjusted between approx 1.5 to 8kb, depending on the sequencing strategy and the organism to be analysed.
After smoothing the fragment ends by means of T4 polymerase and E coli polymerase I, an overhanging A base is attached to these ends by means of a special high efficiency enzyme.
After size fractioning with agarose gel electrophoresis, the DNA fragments are cloned into a so-called T overhang vector, resulting in the creation of a shotgun library.
Next, the inserts of a large number of clones from this shotgun library are high throughput sequenced from both ends.
Per fragment, approximately 1300 to 1500 bases are read.
The aim of this shotgun sequencing phase is to reach a multiple sequence coverage of the genome.
To achieve a tenfold sequence coverage of a 2 megabase genome, altogether 20 million bases are identified from approximately 14,300 shotgun fragments.
By means of pairwise sequence comparison, overlapping shotgun fragments can be found and bioinformaticly assembled into so-called contigs (longer contiguous sequences) on a computer.
The resulting sequence information quality as well as the number of contigs after assembly is directly dependent upon the amount of data generated for the respective organism.
Generally, we sequence up to ninefold sequence coverage, and then close remaining gaps between the individual contigs in a subsequent finishing process.
The multiple coverage of most sequence sections results in an extremely high sequence accuracy of about one error in 80,000 bases.
Such a sequence precision is ideally suited for uncovering potentially all existing differences between wild type and production strains, in a subsequent bioinformatic analysis.
In a number of research projects MWG Biotech investigated whether a lower sequence coverage of a microbial genome might also produce a meaningful comparative sequence analysis.
The results proved that extensive sequence analysis is already possible with fivefold coverage.
The studies also showed that the result strongly depends on the genomic structure - eg, GC content - of the respective bacterium.
In general, it can be said that organisms with a high GC content of more than 65% (such as actinomycetes) are a little easier to analyse, as subcloning of genomic shotgun fragments rich in GC does not cause problems in E coli.
Contrary to that, subcloning genomic fragments of some gram positive bacteria, eg of the Lactobacillus or Enterococcus species, can be considerably more difficult.
This is generally due to the presence of promoters in these organisms, which also work well in E coli.
The promoter structures may interfere with plasmid replication or cause the expression of toxic substances.
As a consequence, complete sequence sections may not be represented in the shotgun libraries.
With the help of special low copy vector systems these effects can be minimised, the number of contigs, however, at identical sequence coverage will always have to be higher than for bacteria with high GC content.
Another problem is the high occurrence of repeats like IS elements of transposon sequences in some species, like Lactobacillus.
To avoid errors in contig assembly, the respective sequences have to be masked before assembly.
At the same time this causes a lower sequence coverage with a higher number of gaps.
However, as a rule, it can be assumed that for most organisms a reliable analysis is already possible with fivefold sequence coverage.
Based on many years of experience, MWG Biotech offers custom made vector and sequencing systems for a large variety of bacteria.
The various systems are submitted to a test phase before determining the required sequence coverage.
This process is performed in close cooperation with the industrial partner (threefold - fivefold - eightfold).
Mutation analysis based on sequence comparisons using the MWG Bioinformatics platform.
Three important questions have to be answered by differential analysis of a production strain and the respective reference or wild type strain.
1, Does the production strain contain all annotated genes of the reference strain? If not, are the respective open reading frames completely deleted or are they represented by partial reading frames? The latter generally is hard to evaluate if sequence coverage is low.
2, Does the production strain contain new reading frames not present in the wild type? As the production strain generally is a direct descendant of the reference strain this usually affects reading frames from other organisms specifically inserted in the framework of molecular biological strain development.
3, Which genes or sequence sections contain accidental mutations directly affecting the phenotype, such as increased production of a certain antibiotic.
To clarify this point, each individual position within the respective open reading frame has to be checked for a frame shift, a base exchange (SNP), a base insertion or a base deletion.
Once a mutation has been identified, a subsequent comparison of the corresponding amino acid sequences has to show whether the mutation is a so-called silent mutation or whether it causes an amino acid exchange.
Of special interest in this context are interruptions in reading frames through the introduction of stop codons.
MWG Biotech's bioinformatoric workflow.
In a first step in differential analysis the DNA sequences of the open reading frames of the reference strain (if not available we first do an annotation) are compared with the contig sequences of the production strain sequenced before.
Based on the results an ORF categorisation is performed.
ORF categorisation.
Obviously, there are ORFs, which are 100% identical on nucleotide level in length as well as in sequence.
A second group contains ORFs, which contain basically the complete reading frame, in which, however, that frame contains mutations.
The third groups is made up of ORFs which, due to a large number of mutations or their position, have only partial similarity with the reference ORFs.
The stringency with which these analyses are performed determines the distribution of the individual mutation events in the different categories.
As the differences between production strain and reference strain are not uniform (time and degree, number of mutagenesis steps), stringency parameters are individually discussed for each project at MWG Biotech.
Alignment of genes.
The choice of alignment algorithms is of great importance in this step.
If the wild type ORF aligns globally along the full length of the respective ORF in the production strain, a genome-wide alignment algorithm is used which also includes gaps (Smith-Waterman).
This allows the accurate placement of SNPs.
If the number of gaps, mismatches or frame shifts increases to a degree where a reasonable global alignment of this kind becomes impossible, one returns to Blast to find the best possible local alignment.
It is important to recognise here that the alignments need to be made on both the protein and the nucleotide levels.
In a number of cases, through changes at the nucleotide level between the wild type and the production strain, a wild type gene could not be detected in full within the production strain.
On the peptide level, however, the functional proteins could still be detected.
Another component of the complete evaluation of a production strain are the intergenic regions.
The annotation of these changes in a production strain will deliver even more important information.
Quality values.
Sequence confidence is a vital issue.
When an interesting change in a gene of the production strain is detected, the researcher must know whether this change is a real event - and therefore of biological relevance - or an artefact resulting from the sequencing or assembly processes.
Here the all-important criterion is the quality of the original sequence.
Sequence fragments or 'reads' in the production strain contigs can be exported as individual base peaks.
Using the so-called Phred programme, the quality of each individual base can be evaluated.
Data display.
Once the analysis has been finished the data has to be presented in a form that allows access to the data on a number of levels.
Thus, lists of missing or novel genes with tables of corresponding mutations found within matched genes plus alignment and quality values have to be available.
The user must be able to view sequences globally, as well as locally in detail.
For graphical presentation MWG Biotech has developed the user friendly Annoviewer software tool that allows the parallel representation of multiple gene features.
Detailed information is available by simply moving the mouse over the individual base.
Conclusion.
Competent genome sequencing and detailed gene annotation combined with intelligent bioinformatic analysis and geared to meet the individual mutation analysis requirements supplies reliable results for optimising production strains.