Here are obscure notes about solving a problem with Mr. Bayes, MPI, and SSH:
PROBLEM: Mr. Bayes (or some other MPI application) fails. When we execute this command:
mpirun -v -machinefile .bhosts -np 8 mb < script.nex
. . . we get the following output:
running /common/bin/mb on 8 freebsd_ppc ch_p4 processorsCreated /Users/victor/PI26710Password:Parallel version ofp0_26516: p4_error: Child process exited while making connection to remote process on node003.cluster.private: 0p0_26516: (15.092200) net_send: could not write to fd=5, errno = 32
- cd .ssh
- ssh-keygen -t dsa -f id_dsa
- cat id_dsa.pub >> authorized_keys
- chmod 640 authorized_keys
- Open authorized_keys with your favorite text editor. The first line should contain a key for you@your.awesome.cluster.
- Copy the first line. Paste this line once for each node in the cluster. Change the hostname to match the name of the node. For example, the first few lines of my authorized_keys file looks like this (where “. . .” are pieces I’ve abridged for security reasons):
ssh-dss AAAAB3NzaC1kc3MAAACBAO6K5GKxrd2UO. . .
. . .
b8R7y6RJCTDRDw6iOJK8xKSvnC
X8= victor@my.awesome.cluster.edu
ssh-dss AAAAB3NzaC1kc3MAAACBAO6K5GKxrd2UO. . .
. . .
b8R7y6RJCTDRDw6iOJK8xKSvnC
X8= victor@node002.cluster.private
ssh-dss AAAAB3NzaC1kc3MAAACBAO6K5GKxrd2UO. . .
. . .
b8R7y6RJCTDRDw6iOJK8xKSvnC
X8= victor@node003.cluster.private
. . . and now your MPI application should work.
If you’re fixing this problem for someone else (assuming you have root privileges), do the following additional steps:
- All the keys you generate will be for root@my.awesome.cluster. In authorized_keys and id_dsa.pub, change root@my.awesome.cluster to someone.else@my.awesome.cluster, where someone.else is the appropriate username.
- All the keyfiles you generate will be owned by root, which is not what we want. “chown USERNAME” authorized_keys and id_dsa*.