Posts Tagged ‘science’

EvoDevo IGERT Symposium, Day 1

November 14, 2009

[I’m at Indiana University, attending the 2009 IGERT symposium on evolution, development, and genomics.]

Tonight, we heard from Patrick Phillips and PZ Myers.

Patrick gave a broad overview of the past, present, and future of EvoDevo.  The central question of EvoDevo is: how do developmental systems evolve?  Conversely, we can ask: how does development shape the evolutionary process?  Although EvoDevo has witnessed big progress in the last decade, these central questions are unanswered.  Patrick consequently said, “[grad students], your future is secure!”

Patrick claims that EvoDevo lacks a central theory.  In other fields, there is a unit of study: chemistry examines atoms, biochemistry examines molecules, molecular biology examines DNA, population genetics examines DNA sequences, population biology examines individuals, and community ecology examines species.  For EvoDevo, Patrick asserts the unit of study should be (and is) the cell.

Finally, Patrick talked about experimental barriers for EvoDevo.  The most significant barrier is that the genotypephenotype map is still not completely understood.  A large proportion of research is focused on simply finding genes, let alone understanding how they affect phenotype.  Patrick used Hopi Hoekstra’s work as an example of successful geneotype-phenotype mapping.  (Hopi’s lab revealed the genetic mechanisms controlling mouse coloration patterns).  Although Hopi’s work is seminal, but we still have a long ways to go towards understanding the genetic mechanisms that control complex phenotypes, such as behavior.

After Patrick’s introduction, PZ Myers gave a talk titled, “Repelled and Fascinated: Coping with the Public Response to Evolution.”  PZ Myers authors a famous (or infamous) blog about evolutionary biology, and has lately become a lightening rod for attacks from the creationist and intelligent design community.

PZ started by showing results from pew polls, suggesting that about 50% of the U.S. public does not believe in evolution.  Furthermore, about 16% of U.S. high school science teachers don’t believe in evolution [citation: Berkman et al, 2008, PLoS Bio].  Although these numbers are alarming, PZ thinks the public is only nominally creationist and confounded by the loud voices of creationists.

PZ next gave a “pocket guide to creationism” in which he explained the history of the creationist movement.  PZ traces creationism’s roots to Archbishop James Ussher, who calculated the age of the earth using dates from the bible.  Until the early 1900’s, most of U.S. public was willing to accept the bible as metaphor.  The *best* slide from PZ’s talk was a phylogeny expressing the history of creationism.  I include it here, but I’m sorry that it’s slightly blurry:

[Note to PZ: if you’d rather I don’t share this photo, let me know]

PZ went on to discuss some significant events in the history of creationism: the Scopes trail in 1925, The Genesis Flood in 1961, Edwards vs. Aguillard in 1987, and Kitzmiller vs. Dover in 2005.  PZ claims that “scientific” creationism comes from Seventh Day Adventism, but is has been intellectually laundered to hide or sever it’s Seventh Day Adventist roots.  The most radical change in the creationist movement has been towards portraying evolutionary biologists as “evil.”

In response to the increasing fundamentalism of the creationist movement, PZ asserts that we (evolutionary biologists) should be more active with our outreach.  In particular, we should write blogs!

Journal Club: Structural and Functional Constraints in the Evolution of Protein Families

October 14, 2009

The theme of this year’s IGERT EvoDevo symposium is “Current Frontiers of Evolution, Development, and Genomics.”  Every Friday, starting this week and ending December 4th, our IGERT group is hosting a journal club discussion about our own research in the broader context of paradigm-shifting publications in Evo/Devo/Geno.  This week, I’m leading the discussion about my research in computational methods for ancestral sequence reconstruction in the context of a recent review by Catherine Worth, Sungsam Gong, and Tom L. Blundell titled “Structural and Functional Constraints in the Evolution of Protein Families.” If your campus provides access to the journal Nature Reviews, the paper can be found here: http://www.nature.com/nrm/journal/v10/n10/abs/nrm2762.html

Here are my insights into why this paper is fundamentally relevant for anyone working with genetic sequence data in an evolutionary context. . .

Scientific frontiers appear when we integrate analyses from the micro and the macro scale. Examples of this include how biology is informed by chemistry, chemistry is informed by physics, and classical physics is informed by quantum physics.  This trend is true for EvoDevo: we are rapidly arriving at an understanding of evolution from increasingly scientific first principles.  To be specific, we are beginning to understand how mutations in protein sequence and structure — at the biophysical scale — have consequences for the function and phenotype of cells, species, and individuals — at the macro scale [see Dean and Thornton, Nature Reviews 2007].

In order to reveal the evolutionary trajectory of a particular protein structure, we need to examine ancient forms of that protein.  However, the simple acquisition of ancestral molecules can be a major obstacle when we examine evolutionary histories over millions of years because the ancestral forms are typically extinct.  As a computational alternative, we can time travel via statistical inference [see Thornton, Nature Reviews 2004].

I study computational and phylogenetic methods that make it possible for us to probabilistically infer phylogenies and reconstruct ancestral gene sequences.  One of the most important inventions in the history of phylogenetic methods is the use of Markov models to approximate the evolution of gene sequences.  Markov models are used all over the place in information science: to model natural language, radio transmissions, and white noise.  Markov models are used in speech recognition, your email’s spam filter, and global weather prediction.  Google’s core search algorithm is fundamentally just a complex Markov model.

The core idea of the Markov Model concerns characters transitioning (i.e. mutating) over time.  Suppose we have some character — like a single nucleotide or an amino acid — and it currently is in state X, where X is one of the letters in our nucleotide or amino acid alphabet.  Over time of length t, X will mutate to state Y with probability determined by a matrix of relative substitution ratios.  This model follows the Markov property, where the probability of Y later mutating to state Z over time t2 is independent of its prior state X.

If we calculate transition probabilities for all branches in a phylogenetic tree, we can thus calculate the likelihood of that tree and infer the maximum a posteriori ancestral protein sequence.  In this discussion, I will avoid articulating all the mathematical minutiae of how we calculate probabilities for trees and ancestral sequences; you can learn more by reading this excellent book edited by Oliver Gascuel.  Instead, I want to focus on the substitution matrix: it is an approximation of molecular evolution and it makes critical assumptions about evolutionary forces.

In it’s simplest form (as a 4×4 nucleotide matrix or 20×20 amino acid matrix) substitution matrices assume that all residues with the same state are in a homogenous biophysical environment, and are thus exposed to the same mutational forces.  For example, the WAG matrix assumes that all glutamic acids (E) can be treated equally, and thus the relative substitution rate for any glutamic acid mutating into asparagine (D) is 6.174, while the relative rate of any glutamic acid mutating to cystine (C) is 0.021.  The assumption of structural homogeneity is often invalid; for example, as is illustrated in this week’s review by Worth et al., residues buried in solvent-inaccessible cores of a protein tend to be more conserved than residues located on the exterior of proteins.  This insight implies that we need a secondary substitution matrix expressing relative mutation rates for residues located in protein cores.  As an example, if E stands for an external glutamic acid and E’ stands for a core glutamic acid, we should expect the relative substitution rate for E-to-D to be larger than the relative rate for E’-to-D’.

The article by Worth et al. reviews a large historical body of results concerning protein structure conservation.  The article further describes how we can use environment-specific substitution tables (ESSTs) to explicitly capture information about structural conservation into our Markov model of evolution.  The insights from this paper are fundamental for anyone working with genetic sequence data in an evolutionary context.

Worth CL, Gong S, & Blundell TL (2009). Structural and functional constraints in the evolution of protein families. Nature reviews. Molecular cell biology, 10 (10), 709-20 PMID: 19756040

Evolution 2009: Day 2

June 14, 2009

I saw too many talks today to comprehensively discuss them all.  Here are a few that stand out:

Matt Hahn discussed the correlation (or lack thereof) between protein sequence similarity and protein function similarity.  Although we have increasingly complex models of sequence evolution (using Markov Models, for example), we know almost nothing about how protein function evolves.  Matt raised three questions: (1) How fast does protein function evolve? (2) Can we correlate the rate of evolution for protein function to the rate of evolution for protein sequences? (3) Can we find evidence for differential rates of protein function evolution in different types of protein families? Given the short time constraint (15 minutes!), Matt did not conclusively answer any of these questions — but that’s not necessarily a critique of his lecture.  His hypothesis was that the rate of evolution for protein function should be slower in orthologs and faster in paralogs.  To test this hypothesis,  Matt gathered protein function annotations from the Gene Ontology Consortium and plotted this data against rates of evolution for protein sequences.  Surprisingly, Matt observed (1) orthologs appear to evolve faster than paralogs, and (2) there is no relation between the rates of sequence evolution and functional evolution.  Both of these results are surprising, but difficult to explain.  Obviously, Matt’s results depend on the accuracy of the Gene Ontology annotations, which are unlikely to be entirely accurate.  I think Matt is asking a set of questions that are critically important, but I don’t think accurate answers will be found until we develop a different method for classifying and measuring protein function.

Paul Hohenlohe discussed RAD sequencing with the Illumina Genome Analyzer II to measure genetic variance (as Fst) in stickleback populations.  (RAD sequencing is introduced by Selker et al., Genetics 2007).  Sticklebacks are ancestrally a saltwater fish with bony armor plates.  Sticklebacks colonize freshwater habitats; colonizing populations lose some — or all — of their armor.  Paul used RAD sequencing with Alaskan stickleback populations, and showed that population structures vary between the saltwater and freshwater populations.  Paul’s analysis of stickleback populations provides a compelling example of how RAD sequencing is a high-throughput method for population genomics.

Joe Felsenstein talked about “phylogenetic geometric morphometrics.”  Given homologous extant morphologies with a set of identified (x,y) coordinates, Joe first showed geometric techniques to rotate and translate the extant geometries such that they are “aligned” in an roughly analogous fashion to sequence alignment.  Next, given a phylogeny relating the extant morphologies, Joe discussed a model using Brownian motion to infer ancestral forms — i.e., an ancestral set of Cartesian coordinates.  I’m not a developmental biologist, so I can’t offer much critique of this method.  I’m curious how he plans to deal with missing data — i.e. extant morphologies with (x,y) coordinates that don’t appear in all descendants.

Finally, James Foster talked about “evolutionary computation.”  Specifically, any process which demonstrates replication, variation, and selection will necessarily demonstrate evolution.  James’ point is that evolution can take place on digital artifacts as well as biological artifacts.  He gave several examples of genetic algorithms applied to problems as far-reaching as ML phylogenetic estimation (Zwickl 2006) , electronic circuit construction (Koza 1985), and jet engine design (Rechenberg 1966).  I totally agree with James’ point that evolutionary computation is useful to solve a wide gamut of problems, but I’m afraid his point fell on many deaf ears at this biologically-focused conference.

Okay, that’s it for now.

Evolution 2009: Day 1 roundup

June 13, 2009

I’m attending the Evolution 2009 conference in Moscow, Idaho.  Below are some notes from the first day.  There are eight separate lecture tracks, so it’s impossible for me to see everything.  I’m mostly attending lectures focused on phylogenetics, systematics, and molecular evolution. . .

This morning, I planned to hear Peter Turchin talk about “warfare and the evolution of social complexity.”  Unfortunately, I missed his lecture due to an unpublished schedule rearrangement.  Instead, I listened to talks on the subject on speciationAsegul Birand presented simulations which demonstrate species’ range affects speciation rates.  Marcus Kronforst characterized hotspots of genetic differentiation in Heliconius butterflies; specifically, Marcus showed that wing coloration patterns are adaptive traits that generate reproductive isolation.

Later, I attended a mid-morning session on phylogenetic methods. . .

Jennifer Riplinger (from Jack Sullivan’s lab) discussed the problem of model selection for maximum likelihood bootstrap replicates.  In theory, we should perform model selection for each bootstrap replicate; in practice, most people use the same maximum likelihood model for all replicates.  Jennifer examined the role of replicate model selection on CytB, 18S RNA, and COX1 sequences.  Her results show that model selection for individual bootstrap replicates is unnecessary and does not yield significantly different bootstrap values.  Jennifer makes a good point, but I would like to see her analysis repeated for simulated datasets, where the true phylogenetic partitioning is known.  Furthermore, everyone should be careful about placing too much trust in bootstrap values (see Douady 2003).

Randal Linder presented a software tool “SATe” to simultaneously align sequence data and estimate phylogeny.  Given the short time allowance (only 15 minutes!), I had a difficult time determining how SATe is different from ALIFRITZ or Bali-Phy.  Randal used the “SP” metric to show that SATe produces more accurate alignments than ClustalW, MAFFT, MUSCLE, or Prank.  I am unfamiliar with the “SP” metric, and I wonder if his analysis would yield different results if he used AMA — instead of SP — to measure accuracy.

Alethea Rea presented the “NeighborNet” method to infer phylogenetic networks (instead of trees).  This approach is useful when the true evolutionary history of homologous genes involves recombinant events and/or lateral gene transfer.

Jason Evans (of the Sulllivan Lab) talked about his approach for averaging models during phylogenetic inference.  Due to the short time constraint, I didn’t entirely understand his cost-based averaging method.  I think integrating uncertainty about the evolutionary model is an appealing phylogenetic problem, but I need to read Jason’s publication before I can say anything critical about his particular method.

Rachel Schwartz talked about error in phylogenetic branch length estimation.  Rachel used simulations to show that Bayesian branch lengths (estimated using Mr. Bayes) generally underestimate the true branch length, while maximum likelihood branch lengths generally overestimate the true length.  The underestimation/overestimation bias is magnified for “deep” internal branches.  In general — for a rooted tree — Bayesian branch lengths make old nodes older and young nodes younger.  On the other hand, maximum likelihood branch lengths make old nodes younger and young nodes older.  Overall, the bias is less-pronounced for maximum likelihood estimates, and therefore Bayesian branch lengths should probably be avoided.  Rachel’s talk was robust and comprehensive, and I look forward to reading the forthcoming publication.

Finally, I attended an afternoon symposium in which Michael Alfaro discussed a method (named Medusa) for integrating fossil information into phylogenetic estimates of birth/death rates.  Afterwards, Brian Moore (from John Huelsenbeck’s lab) presented a collection of Bayesian tools for estimating phylogenetic divergence times and diversification rates.

OK, that’s it for now.

Sean Carroll, EvoDevo @ U.O.

May 5, 2009

Sean Carroll visited the University of Oregon over the past couple days.  He’s authored hundreds of research papers and several books on the subject of evolutionary biology.  Here is a brief summary of Sean’s visit. . .

Last night, Sean gave the fifth lecture in our Darwin series.  This was a public talk (for scientists and non-scientists alike) and Sean presented material from his latest book “Remarkable Creatures.” Specifically, he focused on the harrowing stories of Wallace, Darwin, and Bates sailing around South America and the Galapagos islands.  The greatest insight from this lecture was that Wallace and Darwin independently converged on the theory of natural selection.  I think their convergence testifies to the strength of the theory.

Today, Sean gave a technical talk (for the EvoDevo crowd) titled “Endless Flies Most Beautiful: Cis-Regulatory Sequences and the Evolution of Animal Form.”  Sean focused on the central EvoDevo question: How do forms (i.e. morphologies) evolve? He thinks an examination of mosaic pleiotropy is the key to answering this question.   Historically, gene duplication was thought to be the primary mechanism by which new forms evolved.  Sean cites Susumu Ohno’s classic book “Evolution by Gene Development.” However, Sean countered Ohno’s thesis by showing evidence that evolution might actually select against gene duplication.  As an example, the evolutionary history of anthropod and tetrapod Hox genes — a gene that is known to drive some morphologies —  is a story of gene loss, not gene duplication.

Later approaches to the EvoDevo question examined the role of protein sequence evolution, and then eventually King and Wilson examined the role of protein sequence expression.  Essentially, King and Wilson reduced the question “how do forms evolve?” to the micro-question of “how do cis-regulatory elements evolve?”  For the remainder of Sean’s talk, he focused on “cis-regulatory elements as the units of evolution.”

Before the EvoDevo community was examining regulatory elements, inter-species genetic analysis was typically occuring over large taxonomic distances.  This approach proved problematic because transcription factor binding sites are rarely conserved over large phylogenetic distances.  Consequently, the EvoDevo community was forced to find new systems for study.  Sean Carroll’s lab — for example — shifted focus away from studying butterflies and began investigating pigmentation diversity in Drosophila (see Nature, Trends in Genetics, and PNAS).  Unlike butterflies, Drosophila studies offered the ability to explore evolutionary mechanisms at a deeper mechanistic/genetic level.  Among many subsequent results, Sean’s lab discovered the Tan gene locus is responsible for  mosaic pleiotropy in Drosophila Santomea’s wing pigmentation.

Based on results from the Tan gene — and several other studies — Sean concluded that regulatory sequence evolution is the more likely mechanism of morphological change than the coding sequence itself (see PLoS Biology 2005).  Sean gave several examples to support this theory, including a story about the Engrailed gene: an ancient regulatory protein that was recently co-opted to control development of Drosophila wing spots.

Overall, this was an enlightening visit and I feel fortunate to be studying at a university that can engage this caliber science.  For more information, check-out The Carroll Lab.

Pacific Symposium on Biocomputing 2009

January 13, 2009

The 2009 Pacific Symposium on Biocomputing ended last week; Apparently, it was awesome. I couldn’t attend this year. . . but next year?

Although the party is over, the media remains. Here are some compelling links:

The Official PSB 2009 webpage

The official PSB 2009 conference proceedings

The PSB FriendFeed room

Also, it looks like there was a cool workshop on open science.

Summary: Metagenomics, fruit flies, and lessons learned

December 6, 2007

On November 8th, Nature published two cool articles about metagenomic studies of twelve Drosophila (“fruit flies”) species. In the the first paper (click here), The Drosophila 12 Genomes Consortium (D12GG) compared the complete genomic sequences of the twelve Drosophila species, which included the model organism species Drosophila Melanogaster. Although the twelve species are related, they exhibit a surprising amount genetic biodiversity. For example, the evolutionary distance between D. Grimshawi and D. Melanogaster is the same distance as between humans and lizards. As a side note, six months earlier (in May 2007), PLoS Genetics published a similar metagenomic comparison of Drosophila (click here for the paper). In the PLoS paper, Hahn et al. present the (somewhat obvious) conclusion: “the apparent stasis in total gene number among species has masked rapid turnover in individual gene gain and loss.”

On November 8, Nature also published this paper (click here), in which Stark et al. (including Hahn) used the data from D12GG’s research to demonstrate a truly novel insight about the connection between conserved metagenomic sequence motifs and functional elements. The result of this paper allows us to infer the presence of functional elements with a accuracy far surpassing previous methods. Specifically, Stark et al. show how to infer the following functional elements, based on a metagenomic sample:

  • Protein-coding regions: have highly constrained condon substitution regions, and indels have a bias for multiples of three.
  • RNA genes: tolerate substitutions that preserve base pairing.
  • miRNA: can be detected by looking for conserved palindromic stem sequences, which mutable loop sub-sequences between the two palindrome pieces.
  • Regulatory motifs: have high levels of genome-wide conservation.
  • Post-transcriptional motifs: are typically strand-based conservations.