In this tutorial you will be guided in using PaML to detect natural selection on protein-coding datasets. For some of the following exercises there might be more than one single solution.

*These exercises were prepared by Maria Anisimova*

GOAL: In this exercise we apply the simplest codon model with constant parameter over tree branches and MSA sites. We run `codeml`

**twice** since we want to compare the log-likelihood of the model when the parameters are optimised on a fixed tree topology (provided with the exercise) and when the branch lengths are re-estimated using the selected substitution model.

We are going to use the following dataset:

- *bglobin.nuc (MSA)*

- *bglobin.tree (Tree)*

**Run 1 – branch lengths fixed**

We create a new control file for `codeml`

using the template provided in PaML folder. Then, we modify the following settings in order to run the analysis according to the exercise requests:

```
seqfile = /path/to/bglobin.nuc * sequence data filename
treefile =/path/to/bglobin.trees * tree structure file name
outfile = /path/to/output_m0_fixed * main result file name
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
model = 0 * models for codons: 0
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
NSsites = 0 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
* 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ
* 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
* 13:3normal>0
fix_blength = 2 * 0: ignore, -1: random, 1: initial, 2: fixed
```

**Run 2 – branch lengths re-estimated**

```
seqfile = /path/to/bglobin.nuc * sequence data filename
treefile =/path/to/bglobin.trees * tree structure file name
outfile = /path/to/output_m0_ML * main result file name
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
model = 0 * models for codons: 0
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
NSsites = 0 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
* 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ
* 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
* 13:3normal>0
fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed
```

**1. Do you observe the codon frequency bias?**

```
Sums of codon usage counts
------------------------------------------------------------------------------
Phe F TTT 65 | Ser S TCT 29 | Tyr Y TAT 12 | Cys C TGT 22
TTC 68 | TCC 44 | TAC 32 | TGC 6
Leu L TTA 1 | TCA 4 | *** * TAA 0 | *** * TGA 0
TTG 15 | TCG 0 | TAG 0 | Trp W TGG 43
------------------------------------------------------------------------------
Leu L CTT 13 | Pro P CCT 40 | His H CAT 48 | Arg R CGT 2
CTC 54 | CCC 24 | CAC 78 | CGC 11
CTA 6 | CCA 9 | Gln Q CAA 7 | CGA 0
CTG 203 | CCG 4 | CAG 65 | CGG 1
------------------------------------------------------------------------------
Ile I ATT 16 | Thr T ACT 45 | Asn N AAT 49 | Ser S AGT 31
ATC 28 | ACC 46 | AAC 70 | AGC 19
ATA 3 | ACA 4 | Lys K AAA 36 | Arg R AGA 2
Met M ATG 23 | ACG 1 | AAG 163 | AGG 43
------------------------------------------------------------------------------
Val V GTT 62 | Ala A GCT 113 | Asp D GAT 57 | Gly G GGT 49
GTC 51 | GCC 132 | GAC 65 | GGC 109
GTA 4 | GCA 12 | Glu E GAA 40 | GGA 17
GTG 147 | GCG 7 | GAG 81 | GGG 17
------------------------------------------------------------------------------
```

**2. Which position displays the most bias? Why?**

```
Codon position x base (3x4) table, overall
position 1: T:0.13930 C:0.23080 A:0.23652 G:0.39338
position 2: T:0.31005 C:0.20997 A:0.32802 G:0.15196
position 3: T:0.26675 C:0.34191 A:0.05923 G:0.33211
Average T:0.23870 C:0.26089 A:0.20792 G:0.29248
```

**3. What is the ML estimate of the transition-transversion ratio κ?**

fixed branch lengths | re-estimated branch lengths |
---|---|

2.16189 | 2.07049 |

**4. What is the ML estimate of the -ratio?**

fixed branch lengths | re-estimated branch lengths |
---|---|

0.20964 | 0.23685 |

**5. How do you interpret these ML estimates?**

The estimates obtained performing a branch length optimisation are tuned on the codon substitution model selected. This procedure optimises the parameters values together with the topology. However, this does not indicate the correctness of the values inferred since this procedure does not perform tree topology rearrangements. One can also avoid this procedure disabling the optimisation of the branch lengths. Then, the codon substitution model parameters are optimised at best on the topology provided.

GOAL: In this exercise we estimate the parameter following several different strategies:

a. allowing it to change at each branch on the tree topology,

b. fixing it on one single branch (two-categories)

c. fixing it on two branches (three-categories).

For all the following tasks we are going to use the following settings:

```
seqfile = /path/to/lysozymeSmall.nuc * sequence data filename
treefile = /path/to/lysozymeSmall.trees * tree structure file name
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed
```

**a. free-ratio model**

In the *codeml.ctl* file we set the following parameters:

```
model = 1 * models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
```

**b. two dN/dS ratios**

In the *codeml.ctl* file we set the following parameters:

```
model = 2 * models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
```

In the tree file *lysozymeSmall.trees* we must specify the branch where we think the parameter differs from the rest of the tree.

PaML (and codeML) accepts only the following label format at branch

level:`'#1'`

where`1`

is an integer greater than 0

```
((Hsa_Human, Hla_gibbon) ,((Cgu/Can_colobus, Pne_langur) '#1', Mmu_rhesus), (Ssc_squirrelM, Cja_marmoset));
```

In this case we labeled the branch leading to the clade [colobus, langur]. We are therefore assuming that only this branch will have a value for that differs from the rest of the tree.

**c. three dN/dS ratios**

In the *codeml.ctl* file we set the following parameters:

```
model = 2 * models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
```

In the tree file *lysozymeSmall.trees* we must specify the branches where we think the parameter differs from the rest of the tree.

PaML (and codeML) accepts only the following label format at branch

level:`'#1'`

where`1`

is an integer greater than 0

```
((Hsa_Human, Hla_gibbon) '#2',((Cgu/Can_colobus, Pne_langur) '#1', Mmu_rhesus), (Ssc_squirrelM, Cja_marmoset));
```

In this case we labeled the branch leading to the clade [colobus, langur] and the branch leading to the human clade [Hsa_Human, Hla_gibbon]. We are therefore assuming that only those branches will

have values for that differ from the rest of the tree.

**1. Are results consistent with those presented in ref. (2)?**

**2. Based on LRTs, what model fits your data the best (among 2-ratio, 3-ratio and free-ratio models)?**

**3. What are the degrees of freedoms for each comparison?**

**4. What can you tell about the evolution of your gene from the ML estimates under this best model?**

GOAL: In this exercise we apply different codon models to a protein-coding dataset in order to evaluate which one fits better our dataset w.r.t. natural selection detection. We evaluate alternative evolutionary hypotheses using nested codon models.

For all the following tasks we are going to use the following settings:

```
seqfile = /path/to/bglobin.nuc * sequence data filename
treefile = /path/to/bglobin.trees * tree structure file name
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
fix_blength = 1 * 0: ignore, -1: random, 1: initial, 2: fixed
```

**Fitting model M1 to M8 in one single run**

```
model = 0
NSsites = 1 2 3 7 8
fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate
omega = .4 * initial or fixed omega, for codons or codon-based AAs
```

**Fitting model M8a**

```
NSsites = 8
fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate
omega = 1 * initial or fixed omega, for codons or codon-based AAs
```

**1. Which models are nested?**

- M3 vs. M0
- M2 vs. M1
- M8 vs. M7
- M8a vs. M8

We compute the LRT statistic between nested hypothesis using the

following relation:If the hypothesis is correct, then it is asymptotically distributed as the distribution with degrees of freedom ( is equal to the difference in the number of parameters in and . On the other hand, I will discard the simplest model in favour of the most complex one, if the LRT value is greater of the value of the distribution with degrees of freedom for the p-value I selected (i.e. p-value = 0.05).

model | degrees of freedom | LRT | Significant (p-value = 0.01) |
---|---|---|---|

M3 vs M0 | 4 | 256.9 | yes, I reject M0 |

M2 vs M1 | 2 | 7.190 | no, I accept M1 |

M8 vs M7 | 2 | 22.178 | yes, I reject M7 |

M8 vs M8a | 1 | 12.261 | yes, I reject M8a |

**2. How many degrees of freedom do you use each time to test for significance of the LRT statistic?**

- M3 vs. M0 : 37-33 = 4 degrees of freedom
- M2 vs. M1 : 36-34 = 2 degrees of freedom
- M8 vs. M7 : 36-34 = 2 degrees of freedom
- M8 vs. M8a: 36-35 = 1 degrees of freedom

**3. Do your tests suggest positive selection?**

We tested for positive selection every time we compare the likelihood of two nested models (i.e. M2 vs M1) where one allows for positive selection (it allows for portion of sites on the MSA), while the other does not ().

M2a vs M1a, M8 vs M7, M8 vs M8a model tests can detect positive selection.

**4. Interpret the ML estimates relevant to selective pressure.**

Codon models can identify positive selection monitoring the parameter (i.e. on the branches or at each site), therefore we can observe the ML estimates of the parameter when inferred under different codon models:

parameter | M0 | M1 | M2 | M3 | M7 | M8 | M8a |
---|---|---|---|---|---|---|---|

0.23685 | 0.12 | 0.12 | 0.02 | 0.001 | 0.002 | 0.004 | |

1.00 | 1.00 | 0.30 | 0.008 | 0.015 | 0.022 | ||

2.94 | 1.69 | 0.03 | 0.038 | 0.050 | |||

0.07 | 0.070 | 0.088 | |||||

0.13 | 0.112 | 0.139 | |||||

0.21 | 0.165 | 0.207 | |||||

0.32 | 0.233 | 0.307 | |||||

0.45 | 0.323 | 0.491 | |||||

0.63 | 0.446 | 1.00 | |||||

0.85 | 0.651 | ||||||

2.081 | |||||||

proportion | 1.00 | 0.77 | 0.76 | 0.39 | 0.10 | 0.09 | 0.11 |

0.23 | 0.20 | 0.54 | 0.10 | 0.09 | 0.11 | ||

0.03 | 0.08 | 0.10 | 0.09 | 0.11 | |||

0.10 | 0.09 | 0.11 | |||||

0.10 | 0.09 | 0.11 | |||||

0.10 | 0.09 | 0.11 | |||||

0.10 | 0.09 | 0.11 | |||||

0.10 | 0.09 | 0.11 | |||||

0.10 | 0.09 | 0.11 | |||||

0.10 | 0.09 | ||||||

0.06 |

**M0: Constant **: The ML estimates inferred using the M0 model provide a rough average value distributed over the MSA sites. M0 does not allow for distinct classes, therefore we will not retrieve any information regards positive selection.

**M1: NearlyNeutral – (2 categories)**: This model allows for two different value classes ( ). We observe that the 77% of sites shows an , while 23% of sites shows an . If this model is selected by LRT, one can then discuss about a nearly neutral selection that acted on the dataset.

**M2: PositiveSelection – (3 categories)**: This model allows for three different classes of (negative selection, neutral selection, positive selection). We observe the following that the majority (76%) of sites are under negative selection, 20% of sites is neutrally selected and only the 3% of sites is positively selected (with an value = 2.94). The LRT test comparing M2 and M1 selected M1 in favour of M2 with a significance cutoff of 1%.

**M3: discrete – (3 categories)**: This model estimates the value without forcing it to belong to the conventional classes (i.e. negative, neutral, positive). The ML estimation estimates 3 distinct averaged values for the \omega parameter from 3 distinct site proportions. As matter of fact, here we retrieved 2/3 of values which indicate negative selection (). The centre of mass of the sites is located in the negative selection region of the value distribution.

**M7: beta – (10 categories)**: This model infers the parameter values using a Beta distribution with parameters p and q. The ML routine optimises these parameters and cuts the distribution in 10 equal classes. We can observe that all the sites of our MSA are equally distributed across the inferred values, without bias. This model does not account for a specific class which can indicate positive selection.

**M8: beta – (11 categories)**: M8 model is a direct extension of M7 model since it accounts for an additional class for positive selected sites. We observe an homogeneous distribution of the sites among the classes describing neutral or negative selection, while a small proportion is shown to be under strict positive selection. From LRT comparison, we select the model M8 over its alternative M7, confirming the presence of a portion of sites under positive selection.

**M8a: beta&w=1 – (11 categories)**: This variation of the M8 model fixes the upper boundary of the parameter to assume (at maximum) a neutral value . Despite we observe an equal distribution of the sites among the classes, the hypothesis assumed by this model is discarded in favour of the M8 model allowing for positive selection.

**5. If LRTs suggest positive selection, which sites are inferred by the Bayesian approach to be under positive selection (models M2 and M8)?**

**M2 – discarded**

```
Naive Empirical Bayes (NEB) analysis
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)
Pr(w>1) post mean +- SE for w
7 S 0.913 2.772
50 D 0.889 2.725
67 G 0.835 2.620
123 P 0.872 2.693
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)
Pr(w>1) post mean +- SE for w
7 S 0.922 2.929 +- 0.956
48 T 0.542 2.037 +- 1.072
50 D 0.905 2.904 +- 0.991
67 G 0.860 2.798 +- 1.039
123 P 0.883 2.818 +- 0.988
```

**M8 – accepted**

```
Naive Empirical Bayes (NEB) analysis
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)
Pr(w>1) post mean +- SE for w
7 S 0.995** 2.074
42 S 0.845 1.856
48 T 0.942 1.997
50 D 0.990** 2.067
54 G 0.781 1.762
67 G 0.988* 2.064
85 T 0.866 1.887
123 P 0.995** 2.075
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)
Pr(w>1) post mean +- SE for w
7 S 0.979* 2.442 +- 0.534
42 S 0.539 1.625 +- 0.846
48 T 0.833 2.176 +- 0.753
50 D 0.970* 2.430 +- 0.554
54 G 0.520 1.594 +- 0.870
67 G 0.961* 2.413 +- 0.573
85 T 0.616 1.774 +- 0.860
123 P 0.976* 2.435 +- 0.540
```

**6. Do NEB and BEB agree on the sites inferred?**

Both the methods agree on the sites inferred, however they compute different significance values.

**7. Compare results from the LRT comparing M7 vs M8 and the LRT comparing M8a vs M8. Are they both significant (or both non-significant)? If they are both significant, does the Bayesian approach predict the same sites?**

The LRT statistics selects M8 over M7 and M8 over M8a.

GOAL: We want to test if positive selection is acting on sites and these events happened in specific times in the past on the lineages (episodic selection).

**1. We test the first hypothesis H0 ()**

We prepare the codeml.ctl file in the following way:

```
seqfile = lysozymeSmall.nuc * sequence data filename
treefile = lysozymeSmall.trees * tree structure file name
outfile = output_omega_fixed.txt * main result file name
(snip...)
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
(snip...)
model = 2
* models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
(snip...)
NSsites = 2 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
(snip...)
fix_omega = 1 * 1: omega or omega_1 fixed, 0: estimate
omega = 1 * initial or fixed omega, for codons or codon-based AAs
```

**2. We test the second hypothesis H1 ( estimated from the data)**

We prepare the codeml.ctl file in the following way:

```
seqfile = lysozymeSmall.nuc * sequence data filename
treefile = lysozymeSmall.trees * tree structure file name
outfile = output_omega_estimated.txt * main result file name
(snip...)
seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
(snip...)
model = 2
* models for codons:
* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
(snip...)
NSsites = 2 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
(snip...)
fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate
```

- How many LRTs are significant?
- How many LRTs remain significant after the Bonferroni correction for multiple testing?
- What can you tell about the evolution of your gene from the ML estimates obtained after the multiple LRT procedure?

GOAL:

GOAL:

- Z. Yang, R. Nielsen, N. Goldman, A. Krabbe Pedersen, Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites, GENETICS May 1, 2000 vol. 155 no. 1 431-449
- Z Yang, Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution, Molecular Biology and Evolution, Oxford University Press, 1998
- Jianzhi Zhang, Rasmus Nielsen, and Ziheng Yang, Evaluation of an Improved Branch-Site Likelihood Method for Detecting Positive Selection at the Molecular Level, Mol Biol Evol (December 2005) 22(12): 2472-2479 first published online August 17, 2005

doi:10.1093/molbev/msi237 - PaML documentation

Document last updated on 03.02.2016

© 2016 Lorenzo Gatti – Applied Computational Genomic Team (ACGT) @ Institute of Applied Simulations (ZHAW) | Wädenswil | Zürich