Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding)

June 20, 2018 | Author: Anonymous | Category: Science, Biology, Biochemistry, Genetics

Short Description

Download Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding)...

Description

Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding) Arthur L. Delcher and Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland

Outline • A (very) brief overview of microbial gene-finding – Codon composition methods – GeneMark: Markov models

• Glimmer1 & 2 – Interpolated Markov Model (IMM) – Interpolated Context Model (ICM)

• Glimmer3 – Reducing false positives – Improving coding initiation site predictions – Running Glimmer3

Step One • Find open reading frames (ORFs).

…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon

Stop codon

Step One

• Find open reading frames (ORFs). Reverse strand

Stop codon

…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stop codon

Shifted Stop

• But ORFs generally overlap …

Stop codon

Campylobacter jejuni RM1221 30.3%GC

All ORFs on both strands shown - color indicates reading frame Longest ORFs likely to be protein-coding genes Note the low GC content

Campylobacter jejuni RM1221 30.3%GC

Purple lines are the predicted genes

Purple ORFs show annotated (“true”) genes

Campylobacter jejuni RM1221 30.3%GC

Mycobacterium smegmatis MC2 67.4%GC

Note what happens in a high-GC genome

Campylobacter jejuni RM1221 30.3%GC

Mycobacterium smegmatis MC2 67.4%GC

Purple lines show annotated genes

The Problem • Need to decide which orfs are genes. – Then figure out the coding start sites

• Can do homology searches but that won’t find novel genes – Besides, there are errors in the databases

• Generally can assume that there are some known genes to use as training set. – Or just find the obvious ones

Codon Composition • Find patterns of nucleotides in known coding regions (assumed to be available). – Nucleotide distribution at 3 codon positions – Hexamers – GC-skew • (G-C)/(G+C) computed in windows of size N

– Amino-acid composition

• Use these to decide which orfs are genes. – Prefer longer orfs – Must deal with overlaps

Bacterial Replication

Early replication Theta structure

Termination of Replication

E. coli

B. subtilis

Borrelia burgdorferi (Lyme disease pathogen) GC-skew plot

Codon Composition Nucleotide variation at codon position: Mycobacterium smegmatis

Campylobacter jejuni

Codon Position

Codon Position 1

2

a

19%

23%

6%

9%

c

27%

28%

48%

14%

10%

g

42%

20%

39%

33%

44%

t

12%

28%

7%

1

2

3

a

36%

36%

36%

c

13%

17%

g

30%

t

21%

3

Codon-Composition Gene Finders • ZCURVE – Guo, Ou & Zhang, NAR 31, 2003 – Based on nucleotide and di-nucleotide frequency in codons – Uses Z-transform and Fisher linear discriminant

• MED – Ouyang, Zhu, Wang & She, JBCB 2(2) 2004 – Based on amino-acid frequencies – Uses nearest-neighbor classification on entropies

Probabilistic Methods • Create models that have a probability of generating any given sequence. • Train the models using examples of the types of sequences to generate. • The “score” of an orf is the probability of the model generating it. – Can also use a negative model (i.e., a model of nonorfs) and make the score be the ratio of the probabilities (i.e., the odds) of the two models. – Use logs to avoid underflow

Fixed-Order Markov Models • k th-order Markov model bases the probability of an event on the preceding k events. • Example: With a 3rd-order model the probability of this Target sequence: Context

CTAGAT would be:

P(G | CTA)  P(A | TAG)  P(T | AGA) Target

Context

Fixed-Order Markov Models • Advantages: – Easy to train. Count frequencies of (k+1)-mers in training data. – Easy to compute probability of sequence.

• Disadvantages: – Many (k+1)-mers may be undersampled in training data. – Models data as fixed-length chunks. Fixed-Length Context

Target

…ACGTAGTTCAGTA…

GeneMark • Borodovsky & McIninch, Comp. Chem 17, 1993. • Uses 5th-order Markov model. • Model is 3-periodic, i.e., a separate model for each nucleotide position in the codon. • DNA region gets 7 scores: 6 reading frames & non-coding―high score wins. • Lukashin & Borodovsky, Nucl. Acids Res. 26, 1998 is the HMM version.

Interpolated Markov Models (IMM) • Introduced in Glimmer 1.0 Salzberg, Delcher, Kasif & White, NAR 26, 1998.

• Probability of the target position depends on a variable number of previous positions (sometimes 2 bases, sometimes 3, 4, etc.) • How many is determined by the specific context. ggtta • E.g., for context ggtta the next position might depend on previous 3 bases tta . But for context catta all 5 bases might be used.

Real IMMs • Model has additional probabilities, λ, that determine which parts of the context to use. • E.g., the probability of g occurring after context atca is:

 (atca)P (g | atca)  (1   (atca))[(tca)P (g | tca)  (1   (tca))[(ca)P (g | ca)  (1   (ca))[ (a)P (g | a)  (1   (a))P (g)]]]

Real IMMs

• Result is a linear combination of different Markov orders:

b4P(g | atca)  b3P(g | tca)  b2P(g | ca)  b1P(g | a)  b0P(g) where b0  b1  b2  b3  b4  1 • Can view this as interpolating the results of different-order models. • The probability of a sequence is still the probability of the bases in the sequence.

Real IMMs • Problem: How to determine the λ’s (or equivalently the bj’s)? • Traditionally done with EM algorithm using cross-validation (deleted estimation). – Slow – Hard to understand results – Overtraining can be a problem

• We will cover EM later as part of HMMs

Glimmer IMM • Glimmer assumes: – Longer context is always better – Only reason not to use it is undersampling in training data.

• If sequence occurs frequently enough in training data, use it, i.e.,   1 • Otherwise, use frequency and χ2 significance to set λ. • Interpolation is always between only 2 adjacent model lengths.

More Precisely • Suppose context of length k+1 occurs a times, a tp.train #3 Build the icm from the training sequences build-icm -r tp.icm < tp.train #4 Run first Glimmer glimmer3 -o50 -g110 -t30 tpall.1con tp.icm tp.run1 #5 Get training coordinates from first predictions tail +2 tp.run1.predict > tp.coords #6 Create a position weight matrix (PWM) from the regions # upstream of the start locations in tp.coords upstream-coords.awk 25 0 tp.coords | extract tpall.1con - > tp.upstream elph tp.upstream LEN=6 | get-motif-counts.awk > tp.motif #7 Determine the distribution of start-codon usage in tp.coords set startuse = `start-codon-distrib --3comma tpall.1con tp.coords` #8 Run second Glimmer glimmer3 -o50 -g110 -t30 -b tp.motif -P $startuse tpall.1con tp.icm tp

A novel application of Glimmer • P. didemni is a photosynthetic microbe that lives as an endosymbiont in the sea squirt > patella • P. didemni can only be cultured in L. patella cells

L. patella (sea squirt)

A novel application of Glimmer • • • •

Generated 82,337 shotgun reads Bacterial genome 5 Mb Host genome estimated at 160 Mb Depth of coverage therefore much greater for bacterial contigs • Singleton reads primarily belong to host

A novel application of Glimmer • Create training sets by classifying reads from scaffolds > 10kb as bacterial – 36,920 reads

• Reads where both read and mate were singletons were treated as sea squirt – 21,276 reads

• 21,141 reads unclassified

A novel application of Glimmer • Train a non-periodic IMM on both sets of data • 2 IMMs created • Then classify reads using the ratio of scores from the two IMMs • In a 5-fold cross-validation, classification accuracy was – 98.9% on P. didemni reads – 99.9% on L. patella reads

A novel application of Glimmer • Finally, re-assemble using ONLY bacterial reads • Original assembly: – 65 scaffolds of 20 Kb or longer – total scaffold length 5.74 Mb

• Improved assembly: – 58 scaffolds of 20 Kb or longer – total scaffold length 5.84 Mb

Acknowledgements Art Delcher Steven Salzberg Owen White (TIGR) Simon Kasif (Boston U.) Doug Harmon (Loyola College) Kirsten Bratke Edwin Powers (Johns Hopkins U.) Dan Haft (TIGR) Bill Nelson (TIGR)

Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding)

Short Description

Description

Comments

We need your help!