XTALOPT hangs if server connection lost

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22"
Warning: “Cannot connect to ssh server”

However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave

I can do it myself again…i guess my changes from some years ago didn’t stick. What is needed is to loop with say 60 second sleeps after testing for success from ssh. Even if a server or network is down for hours then xtalopt won’t skip a beat and jobs will complete and be tallied from the server as soon as the network comes back.
Sincerely,

Ron

Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie david.lonie@kitware.com wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu wrote:
I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done about that from the XtalOpt side of things. Might be VPN related?

Dave


Avogadro-Discuss mailing list
Avogadro-Discuss@lists.sourceforge.net
avogadro-discuss List Signup and Options

No, it is a xtalopt error. There is no problem for any other
application or the command line. I suspect the call is timing out (in
a very short amount of time) and not trying again.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3

On Mon, Aug 17, 2015 at 9:41 AM, David Lonie david.lonie@kitware.com wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
Avogadro-Discuss@lists.sourceforge.net
avogadro-discuss List Signup and Options

I switched to the CLI ssh and so far no problems! I guess it is a problem with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie david.lonie@kitware.com wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu wrote:
I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done about that from the XtalOpt side of things. Might be VPN related?

Dave


Avogadro-Discuss mailing list
Avogadro-Discuss@lists.sourceforge.net
avogadro-discuss List Signup and Options

Unfortunately the CLI is not correct! I get:

Warning: “Optimizer::Update: Error loading structure at
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/”

and when I look at that directory I find on the client:

ls /home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/
00003x00051 job.sh structure.state xtal.in
ucfbcoh@tomcat:~/src/xtalopt2/build$ ls
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/00003x00051/
job.sh pwscf.save xtal.in xtal.out

I had fixed this also long ago, but it is back again. It is storing
one directory down.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3

On Fri, Aug 21, 2015 at 1:48 AM, Ronald Cohen
rcohen@carnegiescience.edu wrote:

I switched to the CLI ssh and so far no problems! I guess it is a problem
with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie david.lonie@kitware.com wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
Avogadro-Discuss@lists.sourceforge.net
avogadro-discuss List Signup and Options

I think the attached code I edited fixes this problem.

Ron

Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3

On Fri, Aug 21, 2015 at 12:08 PM, Cohen, Ronald
rcohen@carnegiescience.edu wrote:

Unfortunately the CLI is not correct! I get:

Warning: “Optimizer::Update: Error loading structure at
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/”

and when I look at that directory I find on the client:

ls /home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/
00003x00051 job.sh structure.state xtal.in
ucfbcoh@tomcat:~/src/xtalopt2/build$ ls
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/00003x00051/
job.sh pwscf.save xtal.in xtal.out

I had fixed this also long ago, but it is back again. It is storing
one directory down.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 1:48 AM, Ronald Cohen
rcohen@carnegiescience.edu wrote:

I switched to the CLI ssh and so far no problems! I guess it is a problem
with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie david.lonie@kitware.com wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
Avogadro-Discuss@lists.sourceforge.net
avogadro-discuss List Signup and Options

Hi Ron,

We did actually fix this issue for the new update that is coming soon. Let
me know if you have any issues with your fix and I can give you the updated
version of that file.

Thanks,
Patrick

On Friday, August 21, 2015, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I think the attached code I edited fixes this problem.

Ron

Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu <javascript:;>
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 12:08 PM, Cohen, Ronald
<rcohen@carnegiescience.edu <javascript:;>> wrote:

Unfortunately the CLI is not correct! I get:

Warning: “Optimizer::Update: Error loading structure at
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/”

and when I look at that directory I find on the client:

ls /home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/
00003x00051 job.sh structure.state xtal.in
ucfbcoh@tomcat:~/src/xtalopt2/build$ ls
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/00003x00051/
job.sh pwscf.save xtal.in xtal.out

I had fixed this also long ago, but it is back again. It is storing
one directory down.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu <javascript:;>
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 1:48 AM, Ronald Cohen
<rcohen@carnegiescience.edu <javascript:;>> wrote:

I switched to the CLI ssh and so far no problems! I guess it is a
problem
with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie <david.lonie@kitware.com
<javascript:;>> wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald <
rcohen@carnegiescience.edu <javascript:;>>
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be
done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
Avogadro-Discuss@lists.sourceforge.net <javascript:;>
avogadro-discuss List Signup and Options

Hi Ron,

I’m not sure if this error is queue-independent or not. What queueing
system are you using?

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

I can’t get a job to work for more than a few hours when it fails with:

SSHConnectionLibSSH::isConnected(): server timeout.

SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)

“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”

Warning: “Cannot connect to ssh server”

SSHConnectionLibSSH::isConnected(): server timeout.

SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)

“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”

Warning: “Cannot connect to ssh server”

Meanwhile we had ping running in another window and it showed no
errors and loss of network or nameservice.

I think the host just didn’t respond to the ssh call immediately and
the call timed out and xtalopt then dies.
I know how to fix this but have to find time.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3

On Sun, Aug 16, 2015 at 4:35 PM, Patrick Avery psavery@buffalo.edu wrote:

Hey Ron,

So, we have been making several updates for a new release that is coming out
soon. We MIGHT have already fixed this issue (although I don’t recall
explicitly fixing it). But I ran a test today to see what would happen. Let
me know if you think this test adequately mimics your glitch that you found:

I submitted a couple of jobs with XtalOpt, then disconnected my wifi for
about 20 seconds (so the connection to the remote cluster would fail). Then,
I reconnected it, and it read the output from the runs and updated
successfully - no job restarts.

I tried it again for a longer period of time (I disconnected the wifi for
about 3 minutes). After several server timeouts (and it mentioned “Warning:
“Cannot connect to ssh server”” three times in that time period), I
reconnected the wifi. Unfortunately, the run did not continue - it appeared
to be frozen (something we may want to fix). But after exiting out and
resuming the run, it took it a while, but it updated the structures
successfully from the output - no job restarts.

Thanks,
Patrick

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald rcohen@carnegiescience.edu
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
“Cannot connect to ssh server ucfbcoh@legion.rc.ucl.ac.uk:22”
Warning: “Cannot connect to ssh server”

However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
rcohen@carnegiescience.edu
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

I tried building the ssh CLI interface and it builds but fails with:

ead is QThread(0x2aa3ca0)
avogadro: symbol lookup error:
/home/ucfbcoh/lib/avogadro/1_1/contrib/xtalopt.so: undefined symbol:
_ZN9OpenBabel12OBConversion11SetInFormatEPNS_8OBFormatE

By the way, all the Qt errors make it very hard to find the real errors

Sorry–while the IT group tried to trace down this problem they
reinstalled or deleted some library and now xtalopt fails all the
time. I am getting on a plane–will have to try to fix later. Sorry to
bother you. Ron

OK–rebuilt everything and trying ssh CLI . Sorry for the noise
(flight delay!) Ron

Hello

I had fixed this in an earlier version but don’t remember how.