Home   Manual

XTALOPT hangs if server connection lost


#1

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727


#2

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected]
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave


#3

I can do it myself again…i guess my changes from some years ago didn’t stick. What is needed is to loop with say 60 second sleeps after testing for success from ssh. Even if a server or network is down for hours then xtalopt won’t skip a beat and jobs will complete and be tallied from the server as soon as the network comes back.
Sincerely,

Ron

Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie [email protected] wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected] wrote:
I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done about that from the XtalOpt side of things. Might be VPN related?

Dave


Avogadro-Discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/avogadro-discuss


#4

No, it is a xtalopt error. There is no problem for any other
application or the command line. I suspect the call is timing out (in
a very short amount of time) and not trying again.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Mon, Aug 17, 2015 at 9:41 AM, David Lonie [email protected] wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected]
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/avogadro-discuss


#5

I switched to the CLI ssh and so far no problems! I guess it is a problem with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie [email protected] wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected] wrote:
I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done about that from the XtalOpt side of things. Might be VPN related?

Dave


Avogadro-Discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/avogadro-discuss


#6

Unfortunately the CLI is not correct! I get:

Warning: “Optimizer::Update: Error loading structure at
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/”

and when I look at that directory I find on the client:

ls /home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/
00003x00051 job.sh structure.state xtal.in
[email protected]:~/src/xtalopt2/build$ ls
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/00003x00051/
job.sh pwscf.save xtal.in xtal.out

I had fixed this also long ago, but it is back again. It is storing
one directory down.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 1:48 AM, Ronald Cohen
[email protected] wrote:

I switched to the CLI ssh and so far no problems! I guess it is a problem
with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie [email protected] wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected]
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/avogadro-discuss


#7

I think the attached code I edited fixes this problem.

Ron

Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 12:08 PM, Cohen, Ronald
[email protected] wrote:

Unfortunately the CLI is not correct! I get:

Warning: “Optimizer::Update: Error loading structure at
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/”

and when I look at that directory I find on the client:

ls /home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/
00003x00051 job.sh structure.state xtal.in
[email protected]:~/src/xtalopt2/build$ ls
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/00003x00051/
job.sh pwscf.save xtal.in xtal.out

I had fixed this also long ago, but it is back again. It is storing
one directory down.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 1:48 AM, Ronald Cohen
[email protected] wrote:

I switched to the CLI ssh and so far no problems! I guess it is a problem
with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie [email protected] wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected]
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be done
about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/avogadro-discuss


#8

Hi Ron,

We did actually fix this issue for the new update that is coming soon. Let
me know if you have any issues with your fix and I can give you the updated
version of that file.

Thanks,
Patrick

On Friday, August 21, 2015, Cohen, Ronald [email protected]
wrote:

I think the attached code I edited fixes this problem.

Ron

Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected] <javascript:;>
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 12:08 PM, Cohen, Ronald
<[email protected] <javascript:;>> wrote:

Unfortunately the CLI is not correct! I get:

Warning: “Optimizer::Update: Error loading structure at
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/”

and when I look at that directory I find on the client:

ls /home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/
00003x00051 job.sh structure.state xtal.in
[email protected]:~/src/xtalopt2/build$ ls
/home/ucfbcoh/XTALOPT/C12H12/200GPa//00003x00051/00003x00051/
job.sh pwscf.save xtal.in xtal.out

I had fixed this also long ago, but it is back again. It is storing
one directory down.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected] <javascript:;>
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Fri, Aug 21, 2015 at 1:48 AM, Ronald Cohen
<[email protected] <javascript:;>> wrote:

I switched to the CLI ssh and so far no problems! I guess it is a
problem

with sshlib.
Ron
Sent from my iPad

On Aug 17, 2015, at 9:41 AM, David Lonie <[email protected]
<javascript:;>> wrote:

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald <
[email protected] <javascript:;>>

wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

If you’re having issues resolving hosts, there’s not much that can be
done

about that from the XtalOpt side of things. Might be VPN related?

Dave



Avogadro-Discuss mailing list
[email protected] <javascript:;>
https://lists.sourceforge.net/lists/listinfo/avogadro-discuss


#9

Hi Ron,

I’m not sure if this error is queue-independent or not. What queueing
system are you using?

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected]
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727


#10

I can’t get a job to work for more than a few hours when it fails with:

SSHConnectionLibSSH::isConnected(): server timeout.

SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)

“Cannot connect to ssh server [email protected]:22”

Warning: “Cannot connect to ssh server”

SSHConnectionLibSSH::isConnected(): server timeout.

SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)

“Cannot connect to ssh server [email protected]:22”

Warning: “Cannot connect to ssh server”

Meanwhile we had ping running in another window and it showed no
errors and loss of network or nameservice.

I think the host just didn’t respond to the ssh call immediately and
the call timed out and xtalopt then dies.
I know how to fix this but have to find time.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727

On Sun, Aug 16, 2015 at 4:35 PM, Patrick Avery [email protected] wrote:

Hey Ron,

So, we have been making several updates for a new release that is coming out
soon. We MIGHT have already fixed this issue (although I don’t recall
explicitly fixing it). But I ran a test today to see what would happen. Let
me know if you think this test adequately mimics your glitch that you found:

I submitted a couple of jobs with XtalOpt, then disconnected my wifi for
about 20 seconds (so the connection to the remote cluster would fail). Then,
I reconnected it, and it read the output from the runs and updated
successfully - no job restarts.

I tried it again for a longer period of time (I disconnected the wifi for
about 3 minutes). After several server timeouts (and it mentioned “Warning:
“Cannot connect to ssh server”” three times in that time period), I
reconnected the wifi. Unfortunately, the run did not continue - it appeared
to be frozen (something we may want to fix). But after exiting out and
resuming the run, it took it a while, but it updated the structures
successfully from the output - no job restarts.

Thanks,
Patrick

On Fri, Aug 14, 2015 at 4:35 PM, Cohen, Ronald [email protected]
wrote:

I had fixed this in an earlier version but don’t remember how.
Sometimes the connection to the server or nameserver goes down (about
once a day) and I see an error like:

SSHConnectionLibSSH::isConnected(): server timeout.
SSH error: Failed to resolve hostname legion.rc.ucl.ac.uk (Name or
service not known)
"Cannot connect to ssh server [email protected]:22"
Warning: “Cannot connect to ssh server”

However, jobs are still running on the server and it comes back, but
in the meantime xtalopt hangs and never recovers without a restart,
and loss of the running jobs. It should just wait until the server
connection comes back.

Ron


Ronald Cohen
Geophysical Laboratory
Carnegie Institution
5251 Broad Branch Rd., N.W.
Washington, D.C. 20015
[email protected]
office: 202-478-8937
skype: ronaldcohen
https://twitter.com/recohen3
https://www.linkedin.com/profile/view?id=163327727


#11

I tried building the ssh CLI interface and it builds but fails with:

ead is QThread(0x2aa3ca0)
avogadro: symbol lookup error:
/home/ucfbcoh/lib/avogadro/1_1/contrib/xtalopt.so: undefined symbol:
_ZN9OpenBabel12OBConversion11SetInFormatEPNS_8OBFormatE

By the way, all the Qt errors make it very hard to find the real errors


#12

Sorry–while the IT group tried to trace down this problem they
reinstalled or deleted some library and now xtalopt fails all the
time. I am getting on a plane–will have to try to fix later. Sorry to
bother you. Ron


#13

OK–rebuilt everything and trying ssh CLI . Sorry for the noise
(flight delay!) Ron


#14

Hello

I had fixed this in an earlier version but don’t remember how.