Atom, bond, residue custom properties

ghutchis · July 30, 2025, 1:00am

I previously discussed atom properties in CJSON e.g.

cjson["atoms"]["properties"] = { 
   "fukui" = [ 0.0, 1.0, 2.0, 3.0 … ], 
   "polarizability" = [ 0.1, 0.2, 0.3 …],
}

Obviously, it would be nice to expand this to bonds and residues, since we already support arbitrary molecular properties.

Since the code is in C++ we need to think a bit about types of properties:

strings (e.g., labels)
numbers (e.g., partial charges, Fukui indices, micro-pKa, etc.)
booleans / flags
possibly vectors (atom dipoles) and matrices (e.g., full polarizability tensor)

One other caveat for the C++ implementation is that we may often want to only add properties to some atoms or bonds (e.g. “S” or “R” labels on chiral atoms). So we need a dictionary / map data structure between the atom or bond index and the property.

This discussion was inspired in part by Greg Landrum (the RDKit developer) posting about partial charges in SD files. RDKit can store arbitrary properties: The RDKit Book — The RDKit 2025.03.3 documentation

Basically, RDKit actually keeps the atom / bond properties with keys like:
atom.dprop.DASH_CHARGES - the dprop indicates float / double properties for help with the C++ implementation. (This also makes it easier for JSON validation, e.g., list of strings, list of numbers, list of booleans)

Personally, I’d prefer the CJSON syntax of storing ["atoms"]["properties"] (i.e., properties belonging to atoms) rather than as special keys in the molecular ["properties"] block.

I’m also inclined to keep partial charges separately because those are a fairly common set.

Thoughts? Critiques (esp. for access in CJSON or Python)? Better ideas?

matterhorn103 · July 30, 2025, 2:47pm

This seems to go a bit against the philosophy of CJSON till now, no? I’d have thought it’d be more in-keeping to stick to arrays consistently, thus enabling fast lookup, even if that means some redundant values.

There are after all already some places where arrays are used over an approach with lower redundancy, such as storing the bond order for all bonds even though the majority could be assumed to be 1.

Specifically for chirality I’d have thought the best option would be an array of integers with 1 for R, -1 for S, and 0 for achiral.

Naturally you’d want anything in the spec to be typed, but how hard is it to write type-agnostic C++ code for parsing the properties? I would have hoped that atoms.properties and bonds.properties could hold data of arbitrary types under arbitrary keys. Is arbitrary data under the current properties understood regardless of type?

Seems reasonable to me! This would be a bit of a significant change though, would you think about an accompanying version bump for the CJSON spec?

What do you mean by separately? I think a consistent approach would be preferable.

I would suggest that atoms.properties and bonds.properties be places where any property may be stored under any arbitrary string key, with the restriction that it must be an array of per-atom or per-bond values and may not be nested.

If a more complex structure than an unnested array is required to store the information then I think it should be at the top level e.g. under atoms and the structure of that field is then defined as part of the specification. Arbitrary keys under atoms shouldn’t be allowed.

ghutchis · July 30, 2025, 3:03pm

Yeah. The problem is that while it’s possible to have a missing / empty label (e.g., “R” or “S” or ““) with strings, and potentially floating point (i.e., nan or null) I have no idea on how to support boolean or integer arrays with missing values in an array. A boolean literally is true or false, and there’s no “not an integer” value.

So if it’s an array (rather than a map), you have to have some way to represent missing values. Thus the choice - either properties are limited to types that can have missing values (strings, float, possibly vector / matrix) in an array, or they’re a map. Which is why I posted for discussion.

First off, it helps if they’re all the same type. Mixing strings and floating-point in an array would be painful. The reason for my previous suggestion is that boolean is represented as 1, 0, 1 in json, which is otherwise impossible to separate from integers. Similarly integers could easily be floats.

So if we restrict properties to strings and floats (numbers), then we’re good. Otherwise, I think we need a type hint somehow.

Not really. At the moment, the CJSON spec is incomplete. It doesn’t say anything about atom / bond properties. And there’s already support for flexible keys. Other programs, for example, don’t need to care about the projection matrix or the layers sections.

matterhorn103 · July 30, 2025, 5:35pm

The null in JSON is its own type, not a floating point. So in effect, any array containing some nulls is mixed type.

In fact, there’s no distinction between integers and floats in JSON, there’s just a single number type. There’s also no nan in JSON.

Well no, booleans are represented in JSON as true and false and so are easily distinguished by a JSON parser. I guess what you’re saying is that in C++ you can’t distinguish a bool from 0/1 integers? Which confuses me because I thought the point of C++ is that it’s typed – is a bool not a type that can be checked for by introspection? Like if you have bool x = true; can’t you then check the type of x?

I feel like a lot here comes down to the way that the JSON gets parsed into C++ data structures. What do we use to do the parsing? I think it would be useful to look at the library to get an idea of how the conversion works.

Oh I agree with that, add that to my suggested conditions. With the possible exception that maybe null would be allowed too.

matterhorn103 · July 30, 2025, 5:40pm

Also maybe I’m showing a lack of imagination but given that the point of a boolean is that it can only have two possible values I can’t see why you would ever need a missing value. Unless the property wasn’t appropriate to store in a bool in the first place and should be a string or integer.

ghutchis · July 30, 2025, 6:29pm

But that’s my point. It’s possible to figure out how to handle a null / missing value with arrays of float or strings in C++. For a float, you just use a nan and for a string, you just make it empty. Done.

But there’s no good way of handling a missing value in an array of booleans or an array of integers.

Now, json doesn’t care - everything is a number, so it might be fine to just say everything becomes a float, no integers. But I can definitely see a use case for an array of booleans potentially with missing values. And the only way to handle that is with a map.

In short, it’s a design decision:

we only allow string or float (or maybe vector / matrix) atom properties, which allows missing values and integers get mapped to floating point
we allow booleans and integers, but we then need to use map / dictionary containers

Right. And my point is that in C++ having an array of mixed-type things can be hard. Because arrays / vectors are supposed to be a container of equal objects. So null is not possible to encode in vector<int> or vector<bool> because null is not an int or a bool type.

matterhorn103 · July 30, 2025, 8:10pm

Ok, I see what you’re saying now.

What about an option 3: any type is fine but missing values are not?

ghutchis · July 30, 2025, 9:23pm

Turns out, in C++17 they added std::optional<T> which fits things nicely. So you can have a vector<optional<bool>> which will either have values or not. (There’s also now std::any but I think almost any property would be arrays of identical types with potentially missing values)

That seems like it’s exactly what we need. The JSON can include null to indicate missing values, and that maps to std::optional<type> as needed.