What is the best way to account for GC-content shift while constructing nucleotide-based phylogenetic tree?

Question

Let's say I want to construct a phylogenetic tree based on orthologous nucleotide sequences; I do not want to use protein sequences to have a better resolution. These species have different GC-content.

If we use a straightforward approach like maximum likelihood with JC69 or any other classical nucleotide model, conserved protein coding sequences of distant species with similar GC-content will artificially cluster together. This will happen because GC-content will mainly affect wobbling codon positions, and they will look similar on the nucleotide level.

What are possible ways to overcome this? I considered the following options so far:

Using protein sequence. This is possible of course, but we lose a lot of information on the short distance. Not applicable to non-coding sequences.
Recoding. In this approach C and T can be combined into a single pyrimidine state Y (G and A could be also combined in some implementations). This sounds interesting, but, first, we also lose some information here. Mathematical properties of the resulting process are not clear. As a result, this approach is not widely used.
Excluding third codon position from the analysis. Losing some short-distance information again. Also, not all synonymous substitution are specific to the third codon positions, so we still expect to have some bias. Not applicable to non-coding sequence.

It should be possible in theory to have a model which allows shifts in GC-content. This will be a non time-reversible Markov process. As far as I understand there are some computational difficulties estimating likelihood for such models.

I would just add that I think there's a key assumption in the setup here: "I do not want to use protein sequences to have a better resolution". We can decompose 'better' here - it's likely to be more precise but also more biased, the latter for all the reasons you outline. — roblanf, May 19 '17 at 07:24
In case you might be interested, I tested some of the approaches you mention, plus a few other recoding schemes (http://dx.doi.org/10.6084/m9.figshare.732758) in the following papers: http://arxiv.org/abs/1307.1586 and http://dx.doi.org/10.1093/molbev/msu105 — bli, May 19 '17 at 09:05

score 5 · Accepted Answer · edited May 19 '17 at 09:24

There are models that take into account compositional heterogeneity both under the maximum likelihood and Bayesian frameworks. Although the substitution process is not time-reversible, the computations are simplified by assuming that the instantaneous rate matrix can be decomposed into an "equilibrium frequency vector" (non-homogeneous) and a symmetric, constant exchange rate matrix.

I guess all your suggestions are also valid, and I remember recoding being used successfully to reduce the GC-content bias (examples in the references above and here).

score 3 · Answer 2 · answered May 19 '17 at 09:18

The following 2004 paper describes a way to model compositional changes across the tree, in a Bayesian framework: https://doi.org/10.1080/10635150490445779

A python package implementing this ("p4"), and improvements added along the years, is available here: https://github.com/pgfoster/p4-phylogenetics

To get started, you may find useful examples here: http://p4.nhm.ac.uk/scripts.html

This has been used in a few large-scale phylogenetic analyses.

M__ · Answer 3 · 2019-04-09T11:20:50.910

1

The answer is the logDet algorithm was constructed to overcome GC% clustering.

At the time it was devised only a distance method was available/implemented, so it wasn't very powerful. The posts here imply that a Bayesian or ML approach are available and these do tightly stick to the model.

Original publication here

edited Apr 09 '19 at 11:20

answered Apr 09 '19 at 09:49

M__

12,263
5
28
47

Do you have a publication or webpage in mind? Can you link it? – nicolallias Apr 09 '19 at 09:54
1

Link provided above. Goes back a long way ... to 1996 – M__ Apr 09 '19 at 11:21

What is the best way to account for GC-content shift while constructing nucleotide-based phylogenetic tree?

3 Answers3