Posts Tagged ‘academia’

Journal Club: Structural and Functional Constraints in the Evolution of Protein Families

October 14, 2009

The theme of this year’s IGERT EvoDevo symposium is “Current Frontiers of Evolution, Development, and Genomics.”  Every Friday, starting this week and ending December 4th, our IGERT group is hosting a journal club discussion about our own research in the broader context of paradigm-shifting publications in Evo/Devo/Geno.  This week, I’m leading the discussion about my research in computational methods for ancestral sequence reconstruction in the context of a recent review by Catherine Worth, Sungsam Gong, and Tom L. Blundell titled “Structural and Functional Constraints in the Evolution of Protein Families.” If your campus provides access to the journal Nature Reviews, the paper can be found here:

Here are my insights into why this paper is fundamentally relevant for anyone working with genetic sequence data in an evolutionary context. . .

Scientific frontiers appear when we integrate analyses from the micro and the macro scale. Examples of this include how biology is informed by chemistry, chemistry is informed by physics, and classical physics is informed by quantum physics.  This trend is true for EvoDevo: we are rapidly arriving at an understanding of evolution from increasingly scientific first principles.  To be specific, we are beginning to understand how mutations in protein sequence and structure — at the biophysical scale — have consequences for the function and phenotype of cells, species, and individuals — at the macro scale [see Dean and Thornton, Nature Reviews 2007].

In order to reveal the evolutionary trajectory of a particular protein structure, we need to examine ancient forms of that protein.  However, the simple acquisition of ancestral molecules can be a major obstacle when we examine evolutionary histories over millions of years because the ancestral forms are typically extinct.  As a computational alternative, we can time travel via statistical inference [see Thornton, Nature Reviews 2004].

I study computational and phylogenetic methods that make it possible for us to probabilistically infer phylogenies and reconstruct ancestral gene sequences.  One of the most important inventions in the history of phylogenetic methods is the use of Markov models to approximate the evolution of gene sequences.  Markov models are used all over the place in information science: to model natural language, radio transmissions, and white noise.  Markov models are used in speech recognition, your email’s spam filter, and global weather prediction.  Google’s core search algorithm is fundamentally just a complex Markov model.

The core idea of the Markov Model concerns characters transitioning (i.e. mutating) over time.  Suppose we have some character — like a single nucleotide or an amino acid — and it currently is in state X, where X is one of the letters in our nucleotide or amino acid alphabet.  Over time of length t, X will mutate to state Y with probability determined by a matrix of relative substitution ratios.  This model follows the Markov property, where the probability of Y later mutating to state Z over time t2 is independent of its prior state X.

If we calculate transition probabilities for all branches in a phylogenetic tree, we can thus calculate the likelihood of that tree and infer the maximum a posteriori ancestral protein sequence.  In this discussion, I will avoid articulating all the mathematical minutiae of how we calculate probabilities for trees and ancestral sequences; you can learn more by reading this excellent book edited by Oliver Gascuel.  Instead, I want to focus on the substitution matrix: it is an approximation of molecular evolution and it makes critical assumptions about evolutionary forces.

In it’s simplest form (as a 4×4 nucleotide matrix or 20×20 amino acid matrix) substitution matrices assume that all residues with the same state are in a homogenous biophysical environment, and are thus exposed to the same mutational forces.  For example, the WAG matrix assumes that all glutamic acids (E) can be treated equally, and thus the relative substitution rate for any glutamic acid mutating into asparagine (D) is 6.174, while the relative rate of any glutamic acid mutating to cystine (C) is 0.021.  The assumption of structural homogeneity is often invalid; for example, as is illustrated in this week’s review by Worth et al., residues buried in solvent-inaccessible cores of a protein tend to be more conserved than residues located on the exterior of proteins.  This insight implies that we need a secondary substitution matrix expressing relative mutation rates for residues located in protein cores.  As an example, if E stands for an external glutamic acid and E’ stands for a core glutamic acid, we should expect the relative substitution rate for E-to-D to be larger than the relative rate for E’-to-D’.

The article by Worth et al. reviews a large historical body of results concerning protein structure conservation.  The article further describes how we can use environment-specific substitution tables (ESSTs) to explicitly capture information about structural conservation into our Markov model of evolution.  The insights from this paper are fundamental for anyone working with genetic sequence data in an evolutionary context.

Worth CL, Gong S, & Blundell TL (2009). Structural and functional constraints in the evolution of protein families. Nature reviews. Molecular cell biology, 10 (10), 709-20 PMID: 19756040

Evolution 2009: Day 3

June 15, 2009

Once again, I saw too many talks to list them all.

In my opinion, today’s best session was titled “The Evolution of Molecular Function” with speakers Patrick Phillips, Jesse Bloom, and Joe Thornton.  This symposium presented — and then demonstrated — a “functional synthesis” approach to molecular evolution.

Patrick began by talking about the history of genetics: statistical genetics and Mendelian genetics fragmented into many subfields over the past seventy years (pictured below).

Each subfield asks a unique — but separate — question about genes (pictured below).  For example, population genetics explores how fitness is determined by the transmission of genes; whereas, molecular genetics explores how genes have effects on phenotype.  Ultimately, an interdisciplenary synthesis provides a holistic understanding of the interplay between genes, gene transmission, gene effects, phenotypes, and fitness.

In the spirit of this “functional synthesis”, Jesse Bloom explained how H1N1 flu virus gained resistance to Oseltamivir (a.k.a. Tamiflu).  Oseltamivir binds the neuraminidase active site, which inhbits H1N1 viral release from an infected cell.  It is suspected that Tamiflu resistance began in 2006; as of 2009, almost all H1N1 strains are Tamiflu resistant.  Resistance is conferred by the H274Y mutation.  By itself, H274Y reduces the fitness of H1N1; it was therefore believed that the H274Y mutation would not spread through the flu population.  Consequently, why did resistance to Tamiflu spread?  Jesse speculates — in general — that some nuetral mutations can increase protein stability, thus creating a “stability buffer” enabling fitness-reducing mutations.  For the case of H1N1 Tamiflu resistance, his hypothesis appears to be correct: Jesse revealed that the R194G mutation (a neutral mutation) compensates for the H274Y mutation, thus allowing H274Y to spread through the H1N1 population.

Finally, Joe Thornton talked about the evolution of steroid-hormone receptors.  Whereas Jesse’s previous talk highlighted the interactions of just two molecular mutations, Joe showed how historical trajectories of many mutations led to the incredible diversity and specificity of extant proteins which bind steroid-hormones.  Many of these mutations demonstrate Dollo’s Law, such that they cannot be undone without deleteriously affecting the protein.  For more information, see (1) Thornton, Nature Review Genetics 2004, (2) Bridgham et al. Science 2006, (3) Keay et al. Endocrinology 2006, (4) Ortlund et al. Science 2007, (5) Bridgham et al. PLoS Genetics 2008, and (6) Laskowski et al., Nature Review Genetics 2008.

Okay, that’s it for today

Evolution 2009: Day 1 roundup

June 13, 2009

I’m attending the Evolution 2009 conference in Moscow, Idaho.  Below are some notes from the first day.  There are eight separate lecture tracks, so it’s impossible for me to see everything.  I’m mostly attending lectures focused on phylogenetics, systematics, and molecular evolution. . .

This morning, I planned to hear Peter Turchin talk about “warfare and the evolution of social complexity.”  Unfortunately, I missed his lecture due to an unpublished schedule rearrangement.  Instead, I listened to talks on the subject on speciationAsegul Birand presented simulations which demonstrate species’ range affects speciation rates.  Marcus Kronforst characterized hotspots of genetic differentiation in Heliconius butterflies; specifically, Marcus showed that wing coloration patterns are adaptive traits that generate reproductive isolation.

Later, I attended a mid-morning session on phylogenetic methods. . .

Jennifer Riplinger (from Jack Sullivan’s lab) discussed the problem of model selection for maximum likelihood bootstrap replicates.  In theory, we should perform model selection for each bootstrap replicate; in practice, most people use the same maximum likelihood model for all replicates.  Jennifer examined the role of replicate model selection on CytB, 18S RNA, and COX1 sequences.  Her results show that model selection for individual bootstrap replicates is unnecessary and does not yield significantly different bootstrap values.  Jennifer makes a good point, but I would like to see her analysis repeated for simulated datasets, where the true phylogenetic partitioning is known.  Furthermore, everyone should be careful about placing too much trust in bootstrap values (see Douady 2003).

Randal Linder presented a software tool “SATe” to simultaneously align sequence data and estimate phylogeny.  Given the short time allowance (only 15 minutes!), I had a difficult time determining how SATe is different from ALIFRITZ or Bali-Phy.  Randal used the “SP” metric to show that SATe produces more accurate alignments than ClustalW, MAFFT, MUSCLE, or Prank.  I am unfamiliar with the “SP” metric, and I wonder if his analysis would yield different results if he used AMA — instead of SP — to measure accuracy.

Alethea Rea presented the “NeighborNet” method to infer phylogenetic networks (instead of trees).  This approach is useful when the true evolutionary history of homologous genes involves recombinant events and/or lateral gene transfer.

Jason Evans (of the Sulllivan Lab) talked about his approach for averaging models during phylogenetic inference.  Due to the short time constraint, I didn’t entirely understand his cost-based averaging method.  I think integrating uncertainty about the evolutionary model is an appealing phylogenetic problem, but I need to read Jason’s publication before I can say anything critical about his particular method.

Rachel Schwartz talked about error in phylogenetic branch length estimation.  Rachel used simulations to show that Bayesian branch lengths (estimated using Mr. Bayes) generally underestimate the true branch length, while maximum likelihood branch lengths generally overestimate the true length.  The underestimation/overestimation bias is magnified for “deep” internal branches.  In general — for a rooted tree — Bayesian branch lengths make old nodes older and young nodes younger.  On the other hand, maximum likelihood branch lengths make old nodes younger and young nodes older.  Overall, the bias is less-pronounced for maximum likelihood estimates, and therefore Bayesian branch lengths should probably be avoided.  Rachel’s talk was robust and comprehensive, and I look forward to reading the forthcoming publication.

Finally, I attended an afternoon symposium in which Michael Alfaro discussed a method (named Medusa) for integrating fossil information into phylogenetic estimates of birth/death rates.  Afterwards, Brian Moore (from John Huelsenbeck’s lab) presented a collection of Bayesian tools for estimating phylogenetic divergence times and diversification rates.

OK, that’s it for now.

Sean Carroll, EvoDevo @ U.O.

May 5, 2009

Sean Carroll visited the University of Oregon over the past couple days.  He’s authored hundreds of research papers and several books on the subject of evolutionary biology.  Here is a brief summary of Sean’s visit. . .

Last night, Sean gave the fifth lecture in our Darwin series.  This was a public talk (for scientists and non-scientists alike) and Sean presented material from his latest book “Remarkable Creatures.” Specifically, he focused on the harrowing stories of Wallace, Darwin, and Bates sailing around South America and the Galapagos islands.  The greatest insight from this lecture was that Wallace and Darwin independently converged on the theory of natural selection.  I think their convergence testifies to the strength of the theory.

Today, Sean gave a technical talk (for the EvoDevo crowd) titled “Endless Flies Most Beautiful: Cis-Regulatory Sequences and the Evolution of Animal Form.”  Sean focused on the central EvoDevo question: How do forms (i.e. morphologies) evolve? He thinks an examination of mosaic pleiotropy is the key to answering this question.   Historically, gene duplication was thought to be the primary mechanism by which new forms evolved.  Sean cites Susumu Ohno’s classic book “Evolution by Gene Development.” However, Sean countered Ohno’s thesis by showing evidence that evolution might actually select against gene duplication.  As an example, the evolutionary history of anthropod and tetrapod Hox genes — a gene that is known to drive some morphologies –  is a story of gene loss, not gene duplication.

Later approaches to the EvoDevo question examined the role of protein sequence evolution, and then eventually King and Wilson examined the role of protein sequence expression.  Essentially, King and Wilson reduced the question “how do forms evolve?” to the micro-question of “how do cis-regulatory elements evolve?”  For the remainder of Sean’s talk, he focused on “cis-regulatory elements as the units of evolution.”

Before the EvoDevo community was examining regulatory elements, inter-species genetic analysis was typically occuring over large taxonomic distances.  This approach proved problematic because transcription factor binding sites are rarely conserved over large phylogenetic distances.  Consequently, the EvoDevo community was forced to find new systems for study.  Sean Carroll’s lab — for example — shifted focus away from studying butterflies and began investigating pigmentation diversity in Drosophila (see Nature, Trends in Genetics, and PNAS).  Unlike butterflies, Drosophila studies offered the ability to explore evolutionary mechanisms at a deeper mechanistic/genetic level.  Among many subsequent results, Sean’s lab discovered the Tan gene locus is responsible for  mosaic pleiotropy in Drosophila Santomea’s wing pigmentation.

Based on results from the Tan gene — and several other studies — Sean concluded that regulatory sequence evolution is the more likely mechanism of morphological change than the coding sequence itself (see PLoS Biology 2005).  Sean gave several examples to support this theory, including a story about the Engrailed gene: an ancient regulatory protein that was recently co-opted to control development of Drosophila wing spots.

Overall, this was an enlightening visit and I feel fortunate to be studying at a university that can engage this caliber science.  For more information, check-out The Carroll Lab.


May 4, 2009

Evolutionary Computation: literature reviews

October 13, 2008

Here are two good overview articles on evolutionary computation.  The first article is more recent and is targeted primarily at computer scientists; the second article is slightly outdated and targeted primarily at ecologists.

“Evolutionary Computation in Bioinformatics: A Review” Sankar K. Pal et al., IEEE Transactions 2006

“Evolutionary Computation: An Overview” Melanie Mitchell and Charles E. Taylor, Ecology and Systematics 1999

Summary: Metagenomics, fruit flies, and lessons learned

December 6, 2007

On November 8th, Nature published two cool articles about metagenomic studies of twelve Drosophila (“fruit flies”) species. In the the first paper (click here), The Drosophila 12 Genomes Consortium (D12GG) compared the complete genomic sequences of the twelve Drosophila species, which included the model organism species Drosophila Melanogaster. Although the twelve species are related, they exhibit a surprising amount genetic biodiversity. For example, the evolutionary distance between D. Grimshawi and D. Melanogaster is the same distance as between humans and lizards. As a side note, six months earlier (in May 2007), PLoS Genetics published a similar metagenomic comparison of Drosophila (click here for the paper). In the PLoS paper, Hahn et al. present the (somewhat obvious) conclusion: “the apparent stasis in total gene number among species has masked rapid turnover in individual gene gain and loss.”

On November 8, Nature also published this paper (click here), in which Stark et al. (including Hahn) used the data from D12GG’s research to demonstrate a truly novel insight about the connection between conserved metagenomic sequence motifs and functional elements. The result of this paper allows us to infer the presence of functional elements with a accuracy far surpassing previous methods. Specifically, Stark et al. show how to infer the following functional elements, based on a metagenomic sample:

  • Protein-coding regions: have highly constrained condon substitution regions, and indels have a bias for multiples of three.
  • RNA genes: tolerate substitutions that preserve base pairing.
  • miRNA: can be detected by looking for conserved palindromic stem sequences, which mutable loop sub-sequences between the two palindrome pieces.
  • Regulatory motifs: have high levels of genome-wide conservation.
  • Post-transcriptional motifs: are typically strand-based conservations.


Get every new post delivered to your Inbox.