Heuristic for Reading CIF / Crystal Structures (v2)

I’m curious for feedback about import of crystal structures. Previous versions had a heuristic to show a filled unit cell vs. the primitive unit cell (e.g. Heuristic for Reading CIF / Crystal Structures? )

On the other hand, I can understand the confusion when a symmetric structure is loaded and only a few atoms show up: CIF files with space group other than P1 are not correctly loaded · Issue #1638 · OpenChemistry/avogadrolibs · GitHub

I can see that one previous suggestion was that loading a symmetric crystal structure would bring up a dialog about showing a primitive cell vs. a filled cell. I can also imagine bringing back the heuristic (e.g., if the primitive cell includes less than X carbon atoms, show the filled cell since it’s likely an inorganic crystal).

Thoughts? Suggestions? @thomas - how does CCDC Mercury handle this?

For me, a crystal is a cell and its content. I would note that many of the “options” displayed in the crystal menu actually make no sense if you don’t have the full unit cell:

  • Building a supercell
  • Generate XRD pattern
  • RDF
  • Symmetrize
  • and more generally all the space group menu operations

So if you load a CIF file (with symmetry) and use any of these options, they will either not do anything, or actually produce unphysical results (RDF for example) not matching the structure you’ve just loaded.

Mercury handles this by a clickable box to display either the primitive cell or the conventional cell. But it’s just a display option, and you can easily switch from one view to the other and back.

In summary: I would recommend either loading the full cell, or at least offering a clear choice (and disallowing the menu items that don’t make sense if the structure is incomplete).

That’s fair. The challenge is that there are a surprising number of users who open up a molecular CIF to run a quantum calculation and just want to see the primitive (molecular) representation.

Which is why we’ve tried to have some heuristic. IMHO it’s harder now, because I don’t know a good heuristic to differentiate:

  • a user opening up a CIF of ferrocene and initially wants just the molecular view
  • a user opening up a CIF of a MOF or COF and wants to see the full unit cell

While a user option is clearly necessary, I’m also aware that users expect something “smart.” I saw a lot of comments on Avogadro v1.2 loading CIF “wrong” because it showed the full packing for a molecule.

That’s why I suggested bringing back the “5 carbons” heuristic as the default, with “always primitive” and “always conventional” as options… but I would guess @fxcoudert and research group probably sees a lot of MOF / COF structures? Maybe a combination of “5 carbons” plus the space group?

I’m not sure what a good heuristic is, but… Even with a molecular crystal there is a chance that the asymmetric unit may not be the full molecule?

Rather than trying to choose one or the other automatically, would it be a terrible idea to have a pop up fire whenever a crystal structure is loaded asking the user which they’d prefer?

Possibly with an “Always use this choice in future” tick box if a user knows they will always want one or the other?

By default, CCDC’s Mercury displays the completed molecule, e.g.

Similar “completion” for instance for many entries in P2(1)/c (IT No. 14, a frequently observed space group for organic compounds) with its centre of inversion.

Only if one selects the option “asymmetric unit”, the display is constrained to the content of the atom block, e.g.,

which is somewhat similar to the default display on the page of COD (entry 711061)*

or the one on crystallography.io by Volodymyr Vreshch reading the same data.

Mercury considers filling a unit cell a special case of “packing” along the vectors (a, b, c) all equal 1 unit (calculate → packing/slicing). In addition to select the toggle on the main window (bottom left, just above “asymmetric unit”) one has to select (once) one of four options:

  • display of any atom which fits within the parallelepiped
  • display of any atom if its molecule’s centroid is in the box
  • display of any atom if at least one atom of its molecule is in the box
  • display of any atom if all atoms of is molecule fit into the box

* the underlying Jmol/JSmol console is accessible, so all commands (including packing, export of the structure/generation of a png for local storage) can be used there, too.

7110616.cif.tar.gz (3.7 KB)

Yes it is frequent. Asymmetric unit on a special position and symmetry operators in place allow a crystallographic model easier to establish (in crystallographic terms, to solve a structure, i.e. to assign each atom a place), and to manage (in crystallographic terms, to refine a structure, i.e., to adjust the atomic positions and occupancies while accounting the symmetry relationships present) with less risk of e.g., either forgetting an atom (-> incomplete model), or overfitting the model beyond reason (there are statistical tests for that, too). It is so important that there are programs/routines in place to routinely check and notify if one missed the presence of a higher symmetry in the data, too.

An other example here (in the display of the asymmetric unit).

Right. I’m curious because when I look this up in COD, the JSmol rendering looks like Avogadro - it’s only the asymmetric unit.

In this case, Mercury “knows” to complete the molecule, I guess because:
_chemical_formula_moiety 'C24 H21 N3 O3'

Since there are 8 carbon atoms in the asymmetric unit, I guess it goes through the symmetry to complete the declared chemical formula? (I’ll have to think of the best way to do that … I guess generating everything and only retaining the atoms closest to the centroid of the original set.)

Some sampling seems to indicate that _chemical_formula_moiety isn’t used consistently, so that particular heuristic seems to be that the asymmetric unit doesn’t match the _chemical_formula_sum.

It’s a good idea. It’s just that whenever crystallography comes up, I look to CCDC Mercury as the prototype because it “does the right thing.” (I also look to JSmol because I know Bob Hanson spent a lot of time on those features.)

I presume Jmol and JSmol share the commands the user can issue. If one reads the .cif locally (on the program’s console e.g., load "7110616.cif";)* the molecule is presented in the “symmetry completed” form:

7110616

The visualization by COD with JSmol is the asymmetric unit, because it runs the command load "7110616.cif" {1 1 1}; hide symmetry; instead.

Entry _chemical_formula_moiety in a cif file is a bit more complex. It reports with individual Hill formulae (charges permitted) discrete entities e.g., separate ions, solvent molecules (e.g., hydrates), “the other component” in accidental / intentional co-crystals. See e.g., COD 1501546 with the lines

_chemical_formula_moiety         'C9 H18 O2, C8 H8 N6'
_chemical_formula_sum            'C17 H26 N6 O2'

Here, though P-1 is a space group with centre of inversion, there is no “completion of molecules by symmetry”, each position in the asymmetric unit was defined in the atom block:

But it isn’t always present e.g., in COD 1010527 about \ce{CuSO4 * 5 H2O} reading

_chemical_formula_structural     'Cu S O4 (H2 O)5'
_chemical_formula_sum            'Cu H10 O9 S'

I speculate the maintainers of the COD might help to identify an implementation suitable for Avogadro (e.g., Andrius Merkys, Antanas Vaitkus). Their CLI tool codcif2sdf in cod-tools (which is packaged for DebiChem) has to tackle a similar task to join atoms back into a format “pure chemists” recognize easier. It works really well (which includes the assignment of bond orders) for plain organic structures. Its action is limited either by the chemistry (bond/bond orders in organometallics / inorganic compounds), or the structure model (models with positional, or occupational disorder are better checked/edited manually).


* With a ~/.bashrc amended by an

alias jm="java -jar ~/.../Jmol.jar -L"

the first time a structure file is load into Jmol only requires a jm 7110616.cif. Beside the tab completion by the shell, this retains access to a set of useful flags jm -h lists.

Okay, Bob Hanson wrote me back about the Jmol / JSmol approach.

Basically, if there are bond records in the CIF, it assumes it’s “molecule-ish” (potentially a network solid like diamond or a MOF / COF).

It then looks for atoms in neighboring unit cells to build up the molecular unit by brute-force.

That’s solvable, but it’ll take a bit of time to implement.

I have been using the CIF importing more and more lately, and have found some interesting things.

  1. Loading up a CIF file takes very little time typically.
  2. Using the “Fill Cell” function in the crystal menu works, however if the original molecule shows bonds between atoms, the bonds almost always get stretched out across the cell and in general are very messy.
  3. Building a supercell usually works but with occasional issues. In general a single CIF file will read and you can create a supercell no issue, but if you switch to a different molecule in that instance and then switch back to the supercell, you can cause a crash.
  4. The command to generate molecular properties under the analysis tab doesn’t always show the correct number of atoms, for example in the supercell, even though I had made a 4x4x4 cell, which should have up to 64 times as many atoms, the atom count still read at the same number as the single cell.
  5. On a more positive note, I am incredibly please with the function of the “Bond Atoms” function in the build menu. I was not expecting it to accurately draw bonds in a 5000+ atom supercell, but it performed extremely well.

In general, I am pleased with the reader thus far, the only issues I routinely encounter are the weird bonding when filling a unit cell, but I imagine that that issue could be fixed by just adding a bit of code to that which would remove the bonds and then use the “bond atoms” function.

The general issue with molecular crystals is that the code currently “wraps” all atoms into the filled unit cell before generating a supercell. Consequently, you get these bonds from an atom on one side of the cell … reaching across the cell to its bonded neighbor.

Obviously, I’m interested to hunt down any / all crashes. Can you give a bit more detail? What OS, what version, and what rendering options do you have active?