Archive for the ‘academia’ Category

Evolutionary Computation: literature reviews

October 13, 2008

Here are two good overview articles on evolutionary computation.  The first article is more recent and is targeted primarily at computer scientists; the second article is slightly outdated and targeted primarily at ecologists.

“Evolutionary Computation in Bioinformatics: A Review” Sankar K. Pal et al., IEEE Transactions 2006

“Evolutionary Computation: An Overview” Melanie Mitchell and Charles E. Taylor, Ecology and Systematics 1999


Phylogenetic Dilemma

July 14, 2008

(This post is for my own notes.  It will probably make sense to about 0.00000001% of people subscribed to this blog.)

Consider the two unrooted phylogenies shown below.  Both trees contain the same taxa.  Suppose these topologies were discovered during an MCMC run. The “ingroup” taxa are blue and named with “i-.”  The outgroup taxa are red and named with “o-.”  On the ML tree, its pretty clear which nodes are the last-common ancestor of the ingroup and outgroup.  However, on the alternate tree, the rooting is ambiguous.

Phylogenetic Dilemma

Phylogenetic Dilemma

Simply put, the ingroup and outgroup are not monophyletic. This dilemma is problematic for ancestral sequence reconstruction (ASR) methods which attempt to incorporate phylogenetic uncertainty.  In such methods, we want to calculate the Bayesian average of the ancestral sequence from both trees.  On the alternate tree, which ancestor do we choose?

Here are three potential solutions:

  1. Discard trees with non-monophyletic ingroups/outgroups.
  2. Randomly select one of the putative ancestors, and forget about the other one.
  3. Use the average of the putative ancestral sequences.  In our example, the alternate tree contains two possible roots.  Therefore, we average the ancestral sequences from both of these nodes, and then use this averaged sequence in our overall Bayesian average with other trees.

Fixing Mr. Bayes, MPI, and SSH keys

July 2, 2008

Here are obscure notes about solving a problem with Mr. Bayes, MPI, and SSH:

PROBLEM: Mr. Bayes (or some other MPI application) fails. When we execute this command:

mpirun -v -machinefile .bhosts -np 8 mb < script.nex

. . . we get the following output:

running /common/bin/mb on 8 freebsd_ppc ch_p4 processors
Created /Users/victor/PI26710
Parallel version of
p0_26516: p4_error: Child process exited while making connection to remote process on node003.cluster.private: 0
p0_26516: (15.092200) net_send: could not write to fd=5, errno = 32
DIAGNOSIS: Your SSH keys are not correctly setup to allow MPI to communicate with other nodes.
SOLUTION: Follow these steps. . .
  1. cd .ssh
  2. ssh-keygen -t dsa -f id_dsa
  3. cat >> authorized_keys
  4. chmod 640 authorized_keys
  5. Open authorized_keys with your favorite text editor. The first line should contain a key for you@your.awesome.cluster.
  6. Copy the first line. Paste this line once for each node in the cluster. Change the hostname to match the name of the node. For example, the first few lines of my authorized_keys file looks like this (where “. . .” are pieces I’ve abridged for security reasons):

ssh-dss AAAAB3NzaC1kc3MAAACBAO6K5GKxrd2UO. . .
. . .
ssh-dss AAAAB3NzaC1kc3MAAACBAO6K5GKxrd2UO. . .
. . .

X8= victor@node002.cluster.private
ssh-dss AAAAB3NzaC1kc3MAAACBAO6K5GKxrd2UO. . .
. . .

X8= victor@node003.cluster.private

. . . and now your MPI application should work.

If you’re fixing this problem for someone else (assuming you have root privileges), do the following additional steps:

  1. All the keys you generate will be for root@my.awesome.cluster. In authorized_keys and, change root@my.awesome.cluster to someone.else@my.awesome.cluster, where someone.else is the appropriate username.
  2. All the keyfiles you generate will be owned by root, which is not what we want. “chown USERNAME” authorized_keys and id_dsa*.

MCMC “burn-in” calculator

May 22, 2008

Here is a Ruby script for calculating the “burn-in” for a Markov Chain Monte Carlo run, using the Mr. Bayes software package. In some circles, “burn-in” is referred to as stationarity.

My script performs the following steps:

  1. parses the *.p file(s) from a Mr. Bayes mcmc run.
  2. calculates the average log likelihood for the final 15% of the samples
  3. starting at the top of the *.p file(s), finds the first sample whose log likelihood value is equal to the value calculated in #2. This sample is where the burnin should be drawn. In other words, this sample is where the MCMC run reached stationarity.

You can download the file below; instructions are in the top of the file.

DOWNLOAD HERE: burnin-calc.rb

A counterexample of elision’s efficacy over culling.

April 2, 2008

In response to “Elision: A Method for Accommodating Multiple Molecular Sequence Alignments with Alignment-Ambiguous Sites” Wheeler et al. 1995:

In most cases, elision is useful for resolving alignment-ambiguous sites in a multiple sequence alignment (MSA).  Although Wheeler et al. show that elision is better at resolving MSA ambiguities than culling, here is one counterexample in which elision and culling perform equally:

Consider two putative MSAs (as shown in the figure below). In MSA #1, taxa A and B are more homologous than taxa C. In MSA #2, taxa B and C are more homologous than taxa A. The image below illustrates this case, using a three-character alphabet {gamma, delta, epsilon}. MSA #1 and MSA #2 produce symmetrically opposite phylogenies. When we cull over these MSAs, we produce a star tree (because we cull-out both columns 1 and 2). Furthermore, when we elide over these MSAs we also produce a star tree. Consequently, in this example elision and culling both produce equal support for B = “gamma indel” and B = “indel gamma”.

Happy Holidays Everyone: A LaTeX-thesis pack for the University of Oregon

December 12, 2007

A screenshot from Gmail, in which the Graduate School editor approves my thesis.
(view screenshot here)

If you’re writing a thesis or dissertation for the University of Oregon Graduate School (UOGS), here is a set of LaTeX files which will help you produce a document with approved formatting.

Download Here: (6.5 MB)

These files will help you format most of your document, but some hand-crafting might be required if your thesis or dissertation contains non-standard “stuff.” The instructions are in the file README.txt

Thanks to Peter Boothe for being a LaTeX-ninja, sometimes.

Summary: Metagenomics, fruit flies, and lessons learned

December 6, 2007

On November 8th, Nature published two cool articles about metagenomic studies of twelve Drosophila (“fruit flies”) species. In the the first paper (click here), The Drosophila 12 Genomes Consortium (D12GG) compared the complete genomic sequences of the twelve Drosophila species, which included the model organism species Drosophila Melanogaster. Although the twelve species are related, they exhibit a surprising amount genetic biodiversity. For example, the evolutionary distance between D. Grimshawi and D. Melanogaster is the same distance as between humans and lizards. As a side note, six months earlier (in May 2007), PLoS Genetics published a similar metagenomic comparison of Drosophila (click here for the paper). In the PLoS paper, Hahn et al. present the (somewhat obvious) conclusion: “the apparent stasis in total gene number among species has masked rapid turnover in individual gene gain and loss.”

On November 8, Nature also published this paper (click here), in which Stark et al. (including Hahn) used the data from D12GG’s research to demonstrate a truly novel insight about the connection between conserved metagenomic sequence motifs and functional elements. The result of this paper allows us to infer the presence of functional elements with a accuracy far surpassing previous methods. Specifically, Stark et al. show how to infer the following functional elements, based on a metagenomic sample:

  • Protein-coding regions: have highly constrained condon substitution regions, and indels have a bias for multiples of three.
  • RNA genes: tolerate substitutions that preserve base pairing.
  • miRNA: can be detected by looking for conserved palindromic stem sequences, which mutable loop sub-sequences between the two palindrome pieces.
  • Regulatory motifs: have high levels of genome-wide conservation.
  • Post-transcriptional motifs: are typically strand-based conservations.

SC07 Day 5

November 15, 2007


SC07 Day 4

November 13, 2007

Neil Gershenfeld, the head of MIT’s Center for Bits and Atoms, gave today’s keynote address. To summarize Gershenfeld’s lecture: the killer app of digital fabrication is personal fabrication. Gershenfeld highlighted the MIT FabLabs, and gave examples of boundary-breaking personal computation: wallpaper computers, analog computers, and ad-hoc clusters. This is one of the most compelling keynote lectures I’ve seen. If you’re interested in Neil Gershenfeld, click here to watch his 2007 TED talk.

Alexandros P. Stamatakis presented a paper titled, “Large-scale Maximum Likelihood-based Phylogenetic Analysis on the IBM BlueGene/L”. Stamatakis and his team created a parallel implementation of RAxML on the BlueGene/L. Although I’ve used several software packages for phylogenetic tree construction, I was unaware of RAxML. According to Stamatakis’ publications, RAxML is qualitatively comparable and computationally faster than my current software of choice: Mr. Bayes and PHYML. After hearing Stamatakis’ presentation, I’m interested to use RAxML in one of my current projects, which requires the construction of thousands of phylogenies.

In a related paper session, I learned about the BlueBrain Project, an attempt to computationally simulate every neuron in a mammalian brain. The BlueBrain Project is very ambitious, given the computational complexity of mammalian neurology. (See the image below).

Later. . . . SGI hosted a party at the National Automobile Museum, and SiCortex hosted a party at the National Bowling Stadium. Basically, “cars and bowling” sums up Reno.

SC07 Day 3

November 12, 2007

At the Workshop on Grid Computing Portals and Science Gateways, Marlon Pierce talked about “Web 2.0 for e-Science.” Marlon’s talk was a shotgun blast of compelling information. He talked about micro-programming versus macro-programming, mashups (read more at The Programmable Web), web services, and how computational science can harness the Web 2.0. The big point was that scientific web applications (for instance, GenBank) are the perfect building blocks for scientific mashups. Marlon asserts that most scientific workflows can be implemented with a web mashup, composed of smaller web gadgets and web services. Other errata from Marlon included The Gartner 2006 Hype Cycle. (See the image below).

Later in the day, the exhibition gala was crowded fun. . .