Higher precision for exported XYZ files

I noticed that while CJSON files are exported with coordinates to 14–16 decimal places, XYZ files only get 5.

In comparison ORCA and xtb both write 14 to file. Can Avogadro match that?

Why do you need so many decimal places? Certainly the position of the atoms is not accurate to 1.0 \times 10^{-14} Å. Maybe, just maybe the convergence with a good quantum program is accurate to ~1.0 \times 10^{-5}Å.

CJSON simply uses the default for JSON export. I don’t believe any current JSON library lets you set the precision or I’d do that to save space (e.g., Is there a way to control the precision of serialized floating point numbers? · Issue #677 · nlohmann/json · GitHub)

Can you find me an example in which more than 0.00001 Å accuracy is required?

Obviously that ridiculous level of precision is meaningless in a final result. But that’s not the point.

It’s primarily about avoiding rounding errors. Rounding shouldn’t be done until the end. Avogadro can’t and shouldn’t assume that exporting to this or that format is the final step in the user’s workflow and therefore is the appropriate time to do rounding.

If anything, I think producing an XYZ is most likely to be the first step of a workflow not the last.

Even as a single example, if I start an xtb optimization with a crude structure with an XYZ file with either 5 or 14 decimal places, the resulting optimized geometry differs in many of the coordinates in the fifth decimal place. In several cases the difference is greater than 0.00003 A, so the original rounding error is already compounding. This has consequences for reproducing results.

And multi-step calculations will only see further compounding. Sure, the user can use tighter convergence criteria, but the larger the differences in input structures are, the more likely it’ll end up changing something as important as which minimum is reached, right?

Obviously super accurate geometries are not all that important, that’s why it’s broadly fine to use cheaper methods for optimizations and save the expensive methods for single points. But why risk it? Why withhold data that Avogadro has readily available?

It’s also one of principle: Avogadro and other programs shouldn’t throw information away on behalf of the user.

For example, data should be preserved across I/O cycles (to the extent that is viable). If a user loads an XYZ with 10 digit precision, changes a single atom’s position or adds a methyl group or whatever, then saves a new XYZ, why should the user lose information across that process?

Especially because it’s not even “loss” of information, it’s deliberate omission on Avogadro’s part.

What if a user runs a calculation to find a transition state and the program they’ve used gives them it in format (a) with high precision, but then they have to convert it to format (b) for use with some other program that doesn’t accept (a). Isn’t a bad thing that the user would then get a different result depending on whether they used Avogadro to do the conversion or some other tool?

Besides, XYZ files are tiny. ORCA and xtb both print the geometries with truncated precision in the coordinates within their output, I guess because with so many geometries it makes a difference to the size, but for the saved output geometry they use the higher precision. For a single geometry in XYZ format we’re talking about kilobytes and users would be free to reduce the file size that way themselves at a later point if it was so crucial to them.

If you think 14 is just truly unnecessary, then something like 8 or 10 would at least mean it can be reasonably assumed that results from subsequent calculations are accurate to 5 significant figures.

If you think a calculation will actually suffer from accuracy below 0.00001 Å, I’d like to see the specific example. In most cases, QM programs will end geometry optimization before converging to that point.

You seem to think that 10-digits actually has useful information. I agree that information should not be lost. But I also don’t think anything beyond ~5 decimal places in position is actually useful information vs. numerical noise.

Let me be clear. I’m a scientist and I’m absolutely willing to be convinced on this.

I just want data or an example in which there’s a difference.

I feel like you’re misunderstanding me, I’m not trying to argue that the 10+th digit is important information for us as chemists. I’m making the point that the result you get out of any function is different when the values of its parameters are different, and the effect compounds, so if you want two of the “same” calculation to give the “same” result in that they agree to e.g. 5 significant figures you need to start with numbers that agree to much greater precision than that.

I figure I’ll throw in my two cents here, since I often am concerned with precision in position for my research.

I ran an, albeit basic, test calculation that used coordinate inputs with 14, 6, 5, or 4 decimal points as the input.

My method for generating each input file was as follows:

Take the ORCA .xyz output from a previous calculation with 14 decimals, and copy exactly the coordinates into an input file. For 6 decimals, I opened that same .xyz file in Avogadro and used the ORCA input generator, which gave an output with 6 decimals. From there, I used that input as a template, and manually rounded off the coordinates to either 5 or 4 decimal places, making sure I didn’t do any double rounding.

The method used is shown below:
!Opt wB97M-V def2-QZVPP LargePrint DEFGRID3 VeryTightSCF VeryTightOpt
%pal
nprocs 20
end

I have included a tar file here that has the original .xyz file, the inputs for each calculation, the final .xyz for each calculation, and a CSV that summarizes three components of each calculation: final SCF energy, final coordinates, and total time.

The conclusions that I have drawn, and I am sure that the same conclusions will be drawn for the rest of us, is that the deviation in the final output energy and geometry is directly related to the precision of the input geometry. See the below graphs for a brief summary of these results.


Some important things to note are that I only ran 4 calculations, that were on an already-optimized geometry, and so these results only come from a stable point on the potential energy surface, and as such are not by any means extreme cases. I think that if the shape of the PES would drastically alter the results, giving the input precision much greater weight on the final geometry.

From these results, I suggest the same thing that @matterhorn103 does, that coordinates generated from Avogadro are printed with 10 decimals, which should greatly reduce the probability that the outcome of a calculation is significantly altered by the coordinate input precision.

You’re expecting agreement of the order of 1.0e-9 Hartree. Let’s turn that into kcal/mol = 627.509 kcal/mol per Hartree. So that’s an accuracy of … ~6.275090e-7 kcal/mol?

You have way, way bigger errors in your calculation, even at VeryTightSCF, etc.

Try switching processors. Or you probably use the ORCA binaries… Try compiling QM code from source and switching different compilers.

Floating point errors < 1.0e-9 are close to numerical noise. Most QM programs avoid some integrals at that threshold. Most programs will tell you that two floating point numbers are identical at that precision.

And then you’re indicating that there are deviations in the output geometry of ~5e-6 Å because of a roundoff of five decimals in the input coordinates (i.e., the deviation you claim indicates exactly the number of significant figures in the XYZ file).

As I said, I’m willing to take a patch or see data that there’s really a meaningful difference because of the decimal precision in the XYZ files.

But I think your data is basically saying what I am. That there’s insignificant differences at that precision.

I feel like the thing to take away from this is not that the deviations are insignificant, but rather that they exist at all. If you have an unstable location on the PES then these changes are amplified, and furthermore, the associated cost with changing XYZ files from 5 decimals to 10 is extremely low.

See the files I’ve attached. A 8500 atom XYZ file with 5 decimals is 314,612 bytes (coordinatetest.xyz), versus the same file but with 10 decimals is 476,006 bytes (coordinatetest2.xyz). There is only a ~1.5x increase in file size for doubling the decimal precision. Furthermore, if instead of using a bunch of spaces for the XYZ file writing it were instead using tab characters, the file sizes would be cut down even further to 239,231 bytes for 5 decimals (coordinatetest3.xyz) and 366,731 bytes for 10 decimals (coordinatetest4.xyz).

coordinatetest.xyz (307.2 KB)
coordinatetest2.xyz (464.8 KB)
coordinatetest3.xyz (233.6 KB)
coordinatetest4.xyz (358.1 KB)

I think my question is: Even if it might not be chemically significant changes (which I still think for very subtle structures it could be extremely significant), what is the downside?

Ok, I’ve run a bunch of short calculations on a variety of test structures to try and persuade you that Avogadro’s behaviour might make a difference.

What I have done is draw unoptimized molecules by hand in Avogadro, as you might for preparing a file for a calculation. I save a single geometry file for each test molecule containing coordinates to 14+ decimal places, and load it in Python. I then round all the coordinates to the specified precision, which I’ve set at 5 for now since that’s the status quo and the level we’re discussing.

To then simulate what difference variation at that level of precision can make, I generated 100 sets of coordinates where each set of each coordinate had a 50% chance of being increased by 0.00001 angstrom vs the original.

The average rounding error introduced by rounding to 5 decimal places is not generally as high as 1e-5 of course, but it’s of the same order of magnitude and this exercise serves to demonstrate the impact it can have.

I then ran an opt+freq with xtb on all 100 slightly different structures, per test molecule, analysed the results with polars, and compared the variation in various values in the output.

For something like gycerol the variation is not all that huge:

glycerol
┌────────────┬───────────┬───────────┐
│ variable   ┆ std       ┆ range     │
╞════════════╪═══════════╪═══════════╡
│ energy     ┆ 1.7758e-9 ┆ 8.1450e-9 │
│ coord_mean ┆ 0.000003  ┆ 0.000012  │
│ coord_max  ┆ 0.000006  ┆ 0.000023  │
│ freq_mean  ┆ 0.001882  ┆ 0.008211  │
│ freq_max   ┆ 0.00884   ┆ 0.0401    │
└────────────┴───────────┴───────────┘

To explain what these numbers actually mean: I collated the values for the final energy, all of the individual x, y, and z coordinates, and all of the individual frequencies. I then looked at the std error and the range of the results for each individual value. Because there are tens or hundreds of different coordinates, and tens of frequencies, I then found the average std error and average range, and the maximum for each, across all coordinates, and across all frequencies.

So what the table above is saying is that across 100 runs of a glycerol opt, the std error in each coordinate value was on average 3e-6, in the worst case 6e-6, while the largest variation in any of the coordinate values was a range of 2.3e-5. The frequencies also only vary by fractions of wavenumbers, and the energies always come out somewhere in a 8e-9 hartree range.

Benzene also unsurprisingly looks to basically withstand this degree of variation:

┌────────────┬───────────┬───────────┐
│ variable   ┆ std       ┆ range     │
╞════════════╪═══════════╪═══════════╡
│ energy     ┆ 1.4625e-9 ┆ 7.7370e-9 │
│ coord_mean ┆ 0.000002  ┆ 0.000011  │
│ coord_max  ┆ 0.000004  ┆ 0.000019  │
│ freq_mean  ┆ 0.029572  ┆ 0.21317   │
│ freq_max   ┆ 0.3272    ┆ 2.3171    │
└────────────┴───────────┴───────────┘

Though the variation in the frequencies is significant, I guess because of the role of symmetry?

Unfortunately if we look at something larger the compounding of the errors worsens:

aspartic_acid
┌────────────┬──────────┬──────────┐
│ variable   ┆ std      ┆ range    │
╞════════════╪══════════╪══════════╡
│ energy     ┆ 0.000004 ┆ 0.000017 │
│ coord_mean ┆ 0.028839 ┆ 0.103577 │
│ coord_max  ┆ 0.082202 ┆ 0.268131 │
│ freq_mean  ┆ 0.784611 ┆ 3.892848 │
│ freq_max   ┆ 2.97416  ┆ 17.4138  │
└────────────┴──────────┴──────────┘

Here we even have a variation in the single point energy that is approaching the sort of magnitude where it starts to make a difference (though it’s still below the level of being meaningful).

And now three examples that don’t optimize to a minimum but rather a saddle point of some degree (everything previously did optimize to a minimum):

acetic_acid
┌────────────┬───────────┬──────────┐
│ variable   ┆ std       ┆ range    │
╞════════════╪═══════════╪══════════╡
│ energy     ┆ 4.4624e-7 ┆ 0.000002 │
│ coord_mean ┆ 0.000457  ┆ 0.002148 │
│ coord_max  ┆ 0.001112  ┆ 0.005523 │
│ freq_mean  ┆ 0.190551  ┆ 0.868044 │
│ freq_max   ┆ 0.759079  ┆ 3.426    │
└────────────┴───────────┴──────────┘
acetone
┌────────────┬───────────┬───────────┐
│ variable   ┆ std       ┆ range     │
╞════════════╪═══════════╪═══════════╡
│ energy     ┆ 2.2648e-8 ┆ 8.3869e-8 │
│ coord_mean ┆ 0.000027  ┆ 0.000103  │
│ coord_max  ┆ 0.000077  ┆ 0.000294  │
│ freq_mean  ┆ 0.138108  ┆ 0.966263  │
│ freq_max   ┆ 1.521546  ┆ 10.8453   │
└────────────┴───────────┴───────────┘
dimethylurea_dimer
┌────────────┬───────────┬──────────┐
│ variable   ┆ std       ┆ range    │
╞════════════╪═══════════╪══════════╡
│ energy     ┆ 6.4671e-7 ┆ 0.000004 │
│ coord_mean ┆ 0.001024  ┆ 0.00979  │
│ coord_max  ┆ 0.006106  ┆ 0.05883  │
│ freq_mean  ┆ 0.096295  ┆ 0.812409 │
│ freq_max   ┆ 1.594939  ┆ 11.8998  │
└────────────┴───────────┴──────────┘

What you can see is that sometimes the input geometry doesn’t make all that much difference to the geometry of the end result, or to its properties, and sometimes it really does.

My aspartic acid has 0.03 A variation in all the atomic positions in the output just due to randomized 1e-5 differences in the initial positions! And this is despite the fact that it reaches a minimum of the PES in all cases, so it is always in theory a “successful” optimization.

And the most extreme example:

ethylene_glycol_diacetate
┌────────────┬───────────┬──────────┐
│ variable   ┆ std       ┆ range    │
╞════════════╪═══════════╪══════════╡
│ energy     ┆ 0.001118  ┆ 0.002418 │
│ coord_mean ┆ 0.046022  ┆ 0.11918  │
│ coord_max  ┆ 0.485238  ┆ 1.129222 │
│ freq_mean  ┆ 4.01064   ┆ 9.839013 │
│ freq_max   ┆ 83.425478 ┆ 182.9981 │
└────────────┴───────────┴──────────┘

Here we actually have an example where these tiny variations in input structure cause the qualitative result to vary, as the optimization sometimes lands at a minimum (69%) and sometimes only on a saddle point (31%)!

Obviously xtb is not very accurate. Obviously the use of tighter convergence criteria would have an effect, as would tighter grids, as would all the factors you mention like the compiler, the processor, etc. But none of them change the fact that what you put in directly affects what you get out, and sometimes significantly.

And these xtb structures would in most cases be used for a further optimization with DFT. The thing is, now the variation in different input geometries is not only on the order of 10^-5 per coordinate, it’s much higher. So the variation in the DFT results might well also be even greater, because the trajectory during the optimization is of course directly determined by the input geometry.

These are also all main group compounds in the set, nothing heavier than oxygen, no ions, no radicals, nothing hugely complicated. And there are a whole lot of things I didn’t test like NMR shifts and coupling constants, CD spectra, searches for conical intersections with TD-DFT, binding energies in supramolecular complexes, non-covalent interactions, where I don’t personally have any experience but I can imagine they could be quite sensitive to small changes in a geometry.

I want to make clear that I’m not trying to argue that the result of a calculation is made any better or any more valid by increasing the precision of the input geometry, and I’m also not trying to argue that a difference of 1e-5 angstrom is important in a final result. I just want to make the case that someone who prepares a molecule in Avogadro, then exports it to file, then runs a calculation on that file, shouldn’t get different results based on whether they exported to XYZ or CJSON or SDF or PDB or whatever. And opening an XYZ file in Avogadro then re-exporting shouldn’t result in a huge loss of precision, because we can’t rule out that happening as an intermediate step in someone’s workflow.

For comparison, the results with rounding to, then varying at, 14 or 10 decimal places are attached.
results_14.txt (5.6 KB)
results_10.txt (5.8 KB)

I also ran each calculation 20 times with the exact same geometry and the results are identical each time, so xtb is otherwise deterministic on my system at least.

But SDF only supports four decimals and PDB only supports three decimals.
e.g.,
https://www.wwpdb.org/documentation/file-format-content/format32/sect9.html

Of course. That’s related to the optimization convergence criteria.

@brockdyer03 brought up energy differences … on the order of VeryTightSCF in ORCA = 1.0e-9 Hartree. Which, of course.

You’re arguing that there are small differences in the xtb optimization:
https://xtb-docs.readthedocs.io/en/latest/optimization.html#optimization-levels

What you are both arguing, are that there are small numeric differences in calculations.

If we increased the floating point presentation to … say 8 decimal places, someone will come along and tell us that there are small numeric differences. Because there are small numeric differences in any sort of numeric calculation.

I’ll post my answer from Stack Exchange.

I have to say, as someone who doesn’t have really any real desire for either end of things, I just don’t see why not. I understand your point @ghutchis that the small to medium changes in the results from these calculations aren’t really significant. However, I do not understand your reluctance on just changing the output precision.

If you are concerned about precedent, that is understandable, but I can’t think of any other reason why this change shouldn’t be implemented. It certainly couldn’t be file size concerns, since increasing the precision of the coordinates is such a minor change, and any program that is going to take in the XYZ coordinate is already going to convert it to either a 16- or 32-bit float and just have a bunch of trailing zeroes. There also isn’t an existing standard for XYZ files, in fact most programs that output them will print out all 14 decimals that are associated with 16-bit floats.

I just am not sure why this change shouldn’t be implemented other than the fact that the result of such a change is potentially insignificant and thus isn’t worth the time. At any rate, the real reason for increasing the precision is to avoid losing data, as would happen if I loaded an ORCA XYZ output which then would be truncated.

Multiplied across the number of files times all the Avogadro users in existence.

You say it’s an insignificant change, but you haven’t run sampling across 3-million compounds or generated trajectories of thousands of frames of thousands of atoms.

And you might say “but then use a compressed format”. Yep. I want to do exactly that:

If it’s going to make y’all happy to increase the precision to say 8 decimals, please submit a pull request.

XYZ coordinate is already going to convert it to either a 16- or 32-bit float and just have a bunch of trailing zeroes

Do you know how floating point numbers are stored? Remember, they’re base-two.

So a 32-bit float has 24 bits of precision ~7.2 decimal accuracy. Yes, in principle a 64-bit double has ~14 decimal accuracy, but my point is that for atom positions, much of that is really not useful information for any practical calculation.

Why should we be insisting that ~2M downloads should continue saving a pile of useless bits for eternity? As I mentioned, SDF and PDB have even lower accuracy … because in reality, atoms aren’t at fixed positions – they’re dynamic.

If you really, really feel strongly. Fine – as I mentioned earlier – submit the pull request.