I can’t get a job to work for more than a few hours when it fails with:
SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”
SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”
Meanwhile we had ping running in another window and it showed no
errors and loss of network or nameservice.
I think the host just didn’t respond to the ssh call immediately and
the call timed out and xtalopt then dies.
I know how to fix this but have to find time.
Ron
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
On Sun, Aug 16, 2015 at 4:35 PM, Patrick Avery psavery@buffalo.edu wrote:
Hey Ron,
So, we have been making several updates for a new release that is coming out
soon. We MIGHT have already fixed this issue (although I don’t recall
explicitly fixing it). But I ran a test today to see what would happen. Let
me know if you think this test adequately mimics your glitch that you found:
I submitted a couple of jobs with XtalOpt, then disconnected my wifi for
about 20 seconds (so the connection to the remote cluster would fail). Then,
I reconnected it, and it read the output from the runs and updated
successfully - no job restarts.
I tried it again for a longer period of time (I disconnected the wifi for
about 3 minutes). After several server timeouts (and it mentioned “Warning:
“Cannot connect to ssh server”” three times in that time period), I
reconnected the wifi. Unfortunately, the run did not continue - it appeared
to be frozen (something we may want to fix). But after exiting out and
resuming the run, it took it a while, but it updated the structures
successfully from the output - no job restarts.
Thanks,
Patrick
On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:
I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:
SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”
However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.
Ron
Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727