## Tutorial 2: Evaluating selection pressure with codon models

We will use codeml program from PAML by Ziheng Yang. Use the command line mode for the tasks below. First, you need to understand which with control file options to use. Next, try to reproduce the same analyses with codeml or PAML-X (the GUI for PAML).
You will need a dataset of homologous protein-coding DNA sequences (starting with the 1st codon position and ending with the 3rd). We will use data from published articles and will regenerate published results:

1. Site-models: Yang, Z., R. Nielsen, N. Goldman, A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 431-449.
Data 1: bglobin.nuc Tree 2: bglobin.tree
Data 2: HIVenvSweden.nuc Tree 3: HIVenvSweden.tree
Data 3: adh.nuc Tree 1: adh.tree

2. Branch models: Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.
Data 1: lysozymeSmall.nuc Tree 1: lysozymeSmall.trees

3. Branch-site models: Zhang, J., R. Nielsen, and Z. Yang. 2005. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Molecular Biology and Evolution 22:2472-2479.
We will use the same data as in 2 (also analysed in the paper).

These exercises were prepared by Maria Anisimova

1. The simple codon model with constant ω. Choose a dataset from publication 1 and fit model M0 - the most simple codon model with constant ω over time and sites. Run model M0 twice: first with branch lengths fixed to those in the tree file, and once with branch lengths estimated by ML.
Compare the optimised log-likelihoods for the two runs? In which case is it higher? Why?
Next, study the output file for the run with estimated branch lengths:
Do you observe the codon frequency bias?
Study the statistics of nucleotide usage for different codon positions. Which position displays the most bias? Why?
What is the ML estimate of the transition-transversion ratio κ? What is the ML estimate of the ω-ratio? How do you interpret these ML estimates?

2. Branch models. Use the small lysozyme example (publications 2 and 3) to fit free-ratio model for branches.
Are results consistent with those presented in ref. (2)?
Based on LRTs, what model fits your data the best (among 2-ratio, 3-ratio and free-ratio models)?
What are the degrees of freedoms for each comparison?
What can you tell about the evolution of your gene from the ML estimates under this best model?

3. Site-models. Choose a dataset from publication 1 and fit the following site models to your data:
M1, M2, M3, M7, M8a, and M8 (always estimate branch lengths by ML).
Model M0 was already fitted above (make sure you have the output file).
Note: M8a is model M8 with ω for the discrete category fixed to 1.
Which models are nested?
Perform likelihood ratio tests (LRTs) of nested hypotheses.
How many degrees of freedom do you use each time to test for significance of the LRT statistic?
Do your tests suggest positive selection?
Interpret the ML estimates relevant to selective pressure.
If LRTs suggest positive selection, which sites are inferred by the Bayesian approach to be under positive selection (models M2 and M8)?
Do NEB and BEB agree on the sites inferred?
Compare results from the LRT comparing M7 vs M8 and the LRT comparing M8a vs M8. Are they both significant (or both non-significant)? If they are both significant, does the Bayesian approach predict the same sites?

4. Branch-site models. Use the small lysozyme example (publications 2 and 3) to to fit branch-site models to the same data:
For each branch of the tree (one at a time) perform the LRT comparing model MA (ω is estimated) with model MA (fixed ω = 1).
How many LRTs are significant?
How many LRTs remain significant after the Bonferroni correction for multiple testing?
What can you tell about the evolution of your gene from the ML estimates obtained after the multiple LRT procedure?

5. Effect of the tree topology. For one of the datasets you analysed above: Infer a new phylogeny using a simple codon model (M0+Gamma) using CodonPhyML.
Is the phylogeny different from the one you used?
If it is identical, then do you see differences in branch lengths?
If the codon tree is different, use this tree to redo one of the analyses above. Do you still get the same results?
If the codon tree is the same, create a new tree that is different from this topology. Use this tree to redo one of the analyses above. Do you still get the same results?
How important do you think is the effect of the assumed tree topology for the selection analyses?

6. Feeling adventurous? Try the switching codon model implemented in FitModel (Guindon et al 2004).
Use the same dataset and the tree as in task 4.
Use FitModel to test if sites evolved with constant selective pressure over time or whether episodic selective pressure changes took place.
What models would you fit to perform the LRT that answers your question?
Are the results of this analysis consistent with the analyses by branch-site models in codeml?