Archive for the ‘bioinformatics’ Category

Protein alignments, ART, and buggy Nexus Files

July 12, 2008

Here are some notes about a strange problem with ART and Nexus files

If you attempt to run ART, and see an error message like this:

ART is veriyfing that your tree file corresponds with your sequence file. .

-> ART says: some of the taxa in your treefile do not exist in your sequence file.

Missing taxa:
SpADH2 KlADH2 SkADH4 ScADH3 KmADH2 KLADH4 KwADH3 SkADH3 SpADH5 SbADH2 SbADH5 SpADH1 SkADH2 PsADH2 KlADH3 KwADH4 SbADH3 PsADH1 KmADH1 ScADH2 SpADH3 ScADH5 KlADH1 SkADH1 KwADH1 SbADh1 ScADH1

Error!
ART was unable to import your data.  See previous errors.

The problem is that ALL of your taxa are not found in your sequence file.  Obviously, you might feel like you’re going crazy, because your alignment looks fine.

SOLUTION: In your Nexus alignment, find a line which looks like this:

FORMAT DATATYPE=PROTEIN  SYMBOLS = ” 1 2 3 4″  MISSING=? GAP=- ;

Remove the “SYMBOLS” parameter.  Apparently, this parameter trips-up the BioNexus parser.

FORMAT DATATYPE=PROTEIN MISSING=? GAP=- ;

. . . and hopefully that should solve the problem!

Advertisements

MCMC “burn-in” calculator

May 22, 2008

Here is a Ruby script for calculating the “burn-in” for a Markov Chain Monte Carlo run, using the Mr. Bayes software package. In some circles, “burn-in” is referred to as stationarity.

My script performs the following steps:

  1. parses the *.p file(s) from a Mr. Bayes mcmc run.
  2. calculates the average log likelihood for the final 15% of the samples
  3. starting at the top of the *.p file(s), finds the first sample whose log likelihood value is equal to the value calculated in #2. This sample is where the burnin should be drawn. In other words, this sample is where the MCMC run reached stationarity.

You can download the file below; instructions are in the top of the file.

DOWNLOAD HERE: burnin-calc.rb

BioPERL Pitfall

January 3, 2008

Hey, if you’re casually reading my blog you can probably ignore this post. I want to talk about a rare pitfall in BioPERL, and I want to make it very accessible to Google search. So here goes:

If you’re using the Bio::TreeIO package to parse Newick-formatted phylogenetic trees, you might encounter an error which looks like this:

Modification of non-creatable array value attempted, subscript -2 at /Library/Perl/5.8.6/Bio/TreeIO/TreeEventBuilder.pm line 256, <GEN0> chunk 2.

For me, this error appeared when I executed the following code block:

while( my $tree = $treeio->next_tree ) {
      # Do something with the $tree object here }

Unfortunately, three hours of Google searching yielded no clues about a solution. Instead, I dived into the BioPERL source code, and here is what I learned. . .

The error message indicates that PERL is trying to write to an array index which has not been initialized. But here’s the trick: the cause of the problem isn’t your PERL code. The problem actually comes from your Newick tree file.

As you probably know, Newick tree files consist of long parenthetical strings. A short example looks like this:

((A:3, B:1):4, C:5):0;

If your parenthetical tree is missing a parenthesis, then you’ll recieve the error message I mentioned above. For example, a “broken” tree looks like this:

(A:3, B:1):4, C:5):0; # notice the missing first parenthesis.

You might wonder: Why are my trees missing a parenthesis?

Answer: When you re-root or un-root a parenthetical tree, it is possible to lose a parenthesis but still have a tree which can recognized by software like FigTree and TreeGraph. Unfortunately, BioPerl is not so forgiving.

So, long story short:

If you’re using Bio::TreeIO and you get an error like this. . .

Modification of non-creatable array value attempted, subscript -2 at /Library/Perl/5.8.6/Bio/TreeIO/TreeEventBuilder.pm line 256, <GEN0> chunk 2.

. . . check your parentheses!

“DNA in a tigh squeeze”

October 16, 2007

Today Rob Phillips (http://www.rpgroup.caltech.edu/) gave a talk titled, “DNA in a tight squeeze: the other life of a macromolecular assembly.”

Phillips et al. study the atomic-level physics of DNA.  Their recent work focuses on DNA loops which are formed when transcription factors bind to regulatory sites.  Specifically, they modified the lac operon such that they can insert sequences of arbitrary length and content between the Oid and O1 sites.   A surprising result is that DNA is happiest to make loops of 75.5 base pair lengths, whereas the persistence length for DNA is 115 base pairs.  I think this result has ramifications for our understanding of the fitness of regulatory regions. I would like to see an experiment which explores evolutionary fitness with regard to loop length between cis-regulatory sites.

Phillips also showed results from his study correlating the osmotic pressure inside a viral capsid to the speed of genetic ejection.  His results show that viruses eject genetic material (into their host cell) very quickly at first.  However, as the inter-viral osmotic pressure declines, the rate of ejection also drops.  Eventually, the pressure reaches a point where no genetic material is ejected at all.  I have several questions about the content of the genetic material which may or may not be ejected.  Does it transcribe into a functional protein?  Or, does this “tail” material contain garbage which can safely be left inside the virus?

I find this research exciting because it intersects physics, chemistry, biology, statistics, and computer science.

(Finally: check-out the VIPER project for atomic-level exploration of viral structures) 

Spectral Network Analysis

August 29, 2007

This video is exciting:

Pavzner and his team address the challenge of determining a protein’s sequence, given mass spectrometry data. This problem is challenging because every protein undergoes translational modifications, and therefore mass-specs actually measure proteins which deviate (slightly) from their non-translated nucleotide sequences.

An unsolved challenge is to correctly infer a protein’s sequence, given mass-spec data. However, we CAN compare mass-specs against a database of existing (and known) mass-specs in order to guess the sequences. This process is computationally expensive because it requires a linear search of the database for every queried mass-spec search key.

Pavzner present a more efficient algorithm for “guessing” a protein’s sequence, given its mass-spectrum. His technique involves constructing a network of spectral data, and then using that network as the basis for a search. This is remarkably faster than traditional database searches. Pavzner, et al, apply this technique to whole-genome spectrometric data, and yield promising results.

My obscure notes:

17:00 – Why is the signal-to-noise ratio reduced “six-fold” ?

18:01 – Look-up reference for “anti-symmetric pass approach” to solving an alignment between two sequences of unequal length. Can we use this storage technique for other information domains: phylogenetic trees? Electroencephalographic data?

28:07 – The use of “snake venom” makes any science project sound cool.