13

Let's say I want to construct a phylogenetic tree based on orthologous nucleotide sequences; I do not want to use protein sequences to have a better resolution. These species have different GC-content.

If we use a straightforward approach like maximum likelihood with JC69 or any other classical nucleotide model, conserved protein coding sequences of distant species with similar GC-content will artificially cluster together. This will happen because GC-content will mainly affect wobbling codon positions, and they will look similar on the nucleotide level.

What are possible ways to overcome this? I considered the following options so far:

  1. Using protein sequence. This is possible of course, but we lose a lot of information on the short distance. Not applicable to non-coding sequences.

  2. Recoding. In this approach C and T can be combined into a single pyrimidine state Y (G and A could be also combined in some implementations). This sounds interesting, but, first, we also lose some information here. Mathematical properties of the resulting process are not clear. As a result, this approach is not widely used.

  3. Excluding third codon position from the analysis. Losing some short-distance information again. Also, not all synonymous substitution are specific to the third codon positions, so we still expect to have some bias. Not applicable to non-coding sequence.

It should be possible in theory to have a model which allows shifts in GC-content. This will be a non time-reversible Markov process. As far as I understand there are some computational difficulties estimating likelihood for such models.

M__
  • 12,263
  • 5
  • 28
  • 47
Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
  • I would just add that I think there's a key assumption in the setup here: "I do not want to use protein sequences to have a better resolution". We can decompose 'better' here - it's likely to be more precise but also more biased, the latter for all the reasons you outline. – roblanf May 19 '17 at 07:24
  • 2
    In case you might be interested, I tested some of the approaches you mention, plus a few other recoding schemes (http://dx.doi.org/10.6084/m9.figshare.732758) in the following papers: http://arxiv.org/abs/1307.1586 and http://dx.doi.org/10.1093/molbev/msu105 – bli May 19 '17 at 09:05

3 Answers3

5

There are models that take into account compositional heterogeneity both under the maximum likelihood and Bayesian frameworks. Although the substitution process is not time-reversible, the computations are simplified by assuming that the instantaneous rate matrix can be decomposed into an "equilibrium frequency vector" (non-homogeneous) and a symmetric, constant exchange rate matrix.

I guess all your suggestions are also valid, and I remember recoding being used successfully to reduce the GC-content bias (examples in the references above and here).

Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
Leo Martins
  • 669
  • 4
  • 11
3

The following 2004 paper describes a way to model compositional changes across the tree, in a Bayesian framework: https://doi.org/10.1080/10635150490445779

A python package implementing this ("p4"), and improvements added along the years, is available here: https://github.com/pgfoster/p4-phylogenetics

To get started, you may find useful examples here: http://p4.nhm.ac.uk/scripts.html

This has been used in a few large-scale phylogenetic analyses.

bli
  • 3,130
  • 2
  • 15
  • 36
1

The answer is the logDet algorithm was constructed to overcome GC% clustering.

At the time it was devised only a distance method was available/implemented, so it wasn't very powerful. The posts here imply that a Bayesian or ML approach are available and these do tightly stick to the model.

Original publication here

M__
  • 12,263
  • 5
  • 28
  • 47