Posts Tagged ‘protein’

Journal Club: Structural and Functional Constraints in the Evolution of Protein Families

October 14, 2009

The theme of this year’s IGERT EvoDevo symposium is “Current Frontiers of Evolution, Development, and Genomics.”  Every Friday, starting this week and ending December 4th, our IGERT group is hosting a journal club discussion about our own research in the broader context of paradigm-shifting publications in Evo/Devo/Geno.  This week, I’m leading the discussion about my research in computational methods for ancestral sequence reconstruction in the context of a recent review by Catherine Worth, Sungsam Gong, and Tom L. Blundell titled “Structural and Functional Constraints in the Evolution of Protein Families.” If your campus provides access to the journal Nature Reviews, the paper can be found here:

Here are my insights into why this paper is fundamentally relevant for anyone working with genetic sequence data in an evolutionary context. . .

Scientific frontiers appear when we integrate analyses from the micro and the macro scale. Examples of this include how biology is informed by chemistry, chemistry is informed by physics, and classical physics is informed by quantum physics.  This trend is true for EvoDevo: we are rapidly arriving at an understanding of evolution from increasingly scientific first principles.  To be specific, we are beginning to understand how mutations in protein sequence and structure — at the biophysical scale — have consequences for the function and phenotype of cells, species, and individuals — at the macro scale [see Dean and Thornton, Nature Reviews 2007].

In order to reveal the evolutionary trajectory of a particular protein structure, we need to examine ancient forms of that protein.  However, the simple acquisition of ancestral molecules can be a major obstacle when we examine evolutionary histories over millions of years because the ancestral forms are typically extinct.  As a computational alternative, we can time travel via statistical inference [see Thornton, Nature Reviews 2004].

I study computational and phylogenetic methods that make it possible for us to probabilistically infer phylogenies and reconstruct ancestral gene sequences.  One of the most important inventions in the history of phylogenetic methods is the use of Markov models to approximate the evolution of gene sequences.  Markov models are used all over the place in information science: to model natural language, radio transmissions, and white noise.  Markov models are used in speech recognition, your email’s spam filter, and global weather prediction.  Google’s core search algorithm is fundamentally just a complex Markov model.

The core idea of the Markov Model concerns characters transitioning (i.e. mutating) over time.  Suppose we have some character — like a single nucleotide or an amino acid — and it currently is in state X, where X is one of the letters in our nucleotide or amino acid alphabet.  Over time of length t, X will mutate to state Y with probability determined by a matrix of relative substitution ratios.  This model follows the Markov property, where the probability of Y later mutating to state Z over time t2 is independent of its prior state X.

If we calculate transition probabilities for all branches in a phylogenetic tree, we can thus calculate the likelihood of that tree and infer the maximum a posteriori ancestral protein sequence.  In this discussion, I will avoid articulating all the mathematical minutiae of how we calculate probabilities for trees and ancestral sequences; you can learn more by reading this excellent book edited by Oliver Gascuel.  Instead, I want to focus on the substitution matrix: it is an approximation of molecular evolution and it makes critical assumptions about evolutionary forces.

In it’s simplest form (as a 4×4 nucleotide matrix or 20×20 amino acid matrix) substitution matrices assume that all residues with the same state are in a homogenous biophysical environment, and are thus exposed to the same mutational forces.  For example, the WAG matrix assumes that all glutamic acids (E) can be treated equally, and thus the relative substitution rate for any glutamic acid mutating into asparagine (D) is 6.174, while the relative rate of any glutamic acid mutating to cystine (C) is 0.021.  The assumption of structural homogeneity is often invalid; for example, as is illustrated in this week’s review by Worth et al., residues buried in solvent-inaccessible cores of a protein tend to be more conserved than residues located on the exterior of proteins.  This insight implies that we need a secondary substitution matrix expressing relative mutation rates for residues located in protein cores.  As an example, if E stands for an external glutamic acid and E’ stands for a core glutamic acid, we should expect the relative substitution rate for E-to-D to be larger than the relative rate for E’-to-D’.

The article by Worth et al. reviews a large historical body of results concerning protein structure conservation.  The article further describes how we can use environment-specific substitution tables (ESSTs) to explicitly capture information about structural conservation into our Markov model of evolution.  The insights from this paper are fundamental for anyone working with genetic sequence data in an evolutionary context.

Worth CL, Gong S, & Blundell TL (2009). Structural and functional constraints in the evolution of protein families. Nature reviews. Molecular cell biology, 10 (10), 709-20 PMID: 19756040


Evolution 2009: Day 2

June 14, 2009

I saw too many talks today to comprehensively discuss them all.  Here are a few that stand out:

Matt Hahn discussed the correlation (or lack thereof) between protein sequence similarity and protein function similarity.  Although we have increasingly complex models of sequence evolution (using Markov Models, for example), we know almost nothing about how protein function evolves.  Matt raised three questions: (1) How fast does protein function evolve? (2) Can we correlate the rate of evolution for protein function to the rate of evolution for protein sequences? (3) Can we find evidence for differential rates of protein function evolution in different types of protein families? Given the short time constraint (15 minutes!), Matt did not conclusively answer any of these questions — but that’s not necessarily a critique of his lecture.  His hypothesis was that the rate of evolution for protein function should be slower in orthologs and faster in paralogs.  To test this hypothesis,  Matt gathered protein function annotations from the Gene Ontology Consortium and plotted this data against rates of evolution for protein sequences.  Surprisingly, Matt observed (1) orthologs appear to evolve faster than paralogs, and (2) there is no relation between the rates of sequence evolution and functional evolution.  Both of these results are surprising, but difficult to explain.  Obviously, Matt’s results depend on the accuracy of the Gene Ontology annotations, which are unlikely to be entirely accurate.  I think Matt is asking a set of questions that are critically important, but I don’t think accurate answers will be found until we develop a different method for classifying and measuring protein function.

Paul Hohenlohe discussed RAD sequencing with the Illumina Genome Analyzer II to measure genetic variance (as Fst) in stickleback populations.  (RAD sequencing is introduced by Selker et al., Genetics 2007).  Sticklebacks are ancestrally a saltwater fish with bony armor plates.  Sticklebacks colonize freshwater habitats; colonizing populations lose some — or all — of their armor.  Paul used RAD sequencing with Alaskan stickleback populations, and showed that population structures vary between the saltwater and freshwater populations.  Paul’s analysis of stickleback populations provides a compelling example of how RAD sequencing is a high-throughput method for population genomics.

Joe Felsenstein talked about “phylogenetic geometric morphometrics.”  Given homologous extant morphologies with a set of identified (x,y) coordinates, Joe first showed geometric techniques to rotate and translate the extant geometries such that they are “aligned” in an roughly analogous fashion to sequence alignment.  Next, given a phylogeny relating the extant morphologies, Joe discussed a model using Brownian motion to infer ancestral forms — i.e., an ancestral set of Cartesian coordinates.  I’m not a developmental biologist, so I can’t offer much critique of this method.  I’m curious how he plans to deal with missing data — i.e. extant morphologies with (x,y) coordinates that don’t appear in all descendants.

Finally, James Foster talked about “evolutionary computation.”  Specifically, any process which demonstrates replication, variation, and selection will necessarily demonstrate evolution.  James’ point is that evolution can take place on digital artifacts as well as biological artifacts.  He gave several examples of genetic algorithms applied to problems as far-reaching as ML phylogenetic estimation (Zwickl 2006) , electronic circuit construction (Koza 1985), and jet engine design (Rechenberg 1966).  I totally agree with James’ point that evolutionary computation is useful to solve a wide gamut of problems, but I’m afraid his point fell on many deaf ears at this biologically-focused conference.

Okay, that’s it for now.