`userOptions` and `highlightStyles`

So as it stands, the JSON files that a plugin should in future provide for userOptions and highlightStyles are identical to the output that the plugins should currently return upon being called with --print-options (and highlightStyles being a subset of that output).

While working on avo-plugin-demo and avogenerators I felt that it made little sense to have the top level key in each and that it’d be better to remove one layer of nesting in the JSON files.
And I figured the same could be done for the syntax highlighting rules, until I realised that highlightStyles is an array and therefore can’t have the same thing done.

As such, I just decided it wasn’t a big deal and left it for the moment.

When I asked @brockdyer03 to look over my PRs, though, this was one of the things he complained about. So since it turns out to not just be me who’s bothered by it, I’m taking that as a sign that it is a genuine concern. :smiley:

Could we please do the following:

  1. Drop the top-level object in the userOptions JSON that only contains the userOptions key. Note this would also apply when obtained dynamically. For example, instead of this for NWChem:
// options.json
{
  "userOptions": {
    "Title": {
      "type": "string",
      "default": ""
    },
    "Calculation Type": {
      "type": "stringList",
      "default": 1,
      "values": [
        "Single Point",
        "Equilibrium Geometry",
        "Frequencies"
      ]
    },
    ...
  }
}

it’d be just this:

// options.json
{
  "Title": {
    "type": "string",
    "default": ""
  },
  "Calculation Type": {
    "type": "stringList",
    "default": 1,
    "values": [
      "Single Point",
      "Equilibrium Geometry",
      "Frequencies"
    ]
  },
  ...
}

This also makes sense since the only other things that were till now part of the --print-options JSON were things that are now expressed in other places.

  1. Drop the top-level object of the highlightRules JSON, and rather than having an array of highlightStyles where each style has a style key to specify the style name, have an object where the keys are the style names and the values are the rest. For example, instead of this for NWChem:
// syntax.json
{
  "highlightStyles": [
    {
      "style": "default",
      "rules": [
        {
          "patterns": [
            {
              "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b"
            }
          ],
          "format": {
            "preset": "literal"
          }
        },
        ...
      ]
    }
  ]
}

it’d be this:

// syntax.json
{
  "default": {
    "rules": [
      {
        "patterns": [
          {
            "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b"
          }
        ],
        "format": {
          "preset": "literal"
        }
      },
      ...
    ]
  }
}

Much more obvious in both cases, if you ask me, and I think the less nesting the less confusing for anyone wanting to edit an existing generator.

The highlight rules seem fine. I’m less sure about the userOptions because sometimes it’ll have that and sometimes not.. like if an input format is requested? Which I thought we would just generally do anyway?

The only three things in the output of --print-options currently are:

{
  "userOptions": {
    ...
  },
  "highlightStyles" : [
  ...
  ],
  "inputMolecularFormat": "cjson"
}

highlightStyles is now contained in a separate JSON, and the input format is now specified as input-format in the plugin’s metadata TOML (pyproject.toml or avogadro.toml).

So in fact, thinking about it, the current way the new JSON files are structured, with the top-level keys, doesn’t make sense – each one is essentially the --print-options output but with the other two parts removed. Each JSON ought instead to contain just the value of the userOptions or highlightStyles key/value pair.

What do you mean?

1 Like

@brockdyer03 has additionally requested that it also be possible to use TOML files for this purpose if desired. How would you feel about allowing either?

Both are able to represent the data equivalently, and Avogadro will have parsers for both and could just detect the file extension.

Right now all the code for handling the highlight rules is set up for JSON. So it’s possible to support TOML, but it would take some work to refactor that.

Fair enough. For some reasons, I was thinking the userOptions had the inputMolecularFormat as one of the keys.

Maybe not too much, though, since as soon as it’s parsed into a C++ data structure it’s identical? :innocent:

It’s especially the highlight rules that would be extremely nice to be able to use TOML for, because they are full of regex and therefore the JSON is full of backslash escapes, but TOML provides literal strings. Thus the NWChem highlight rules:

{
  "default": {
    "rules": [
      {
        "patterns": [
          {
            "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b"
          }
        ],
        "format": {
          "preset": "literal"
        }
      },
      {
        "patterns": [
          {
            "regexp": "\\b(?:re)?start\\b"
          },
          {
            "regexp": "\\b(?:scratch|permanent)?_dir\\b"
          },
          {
            "regexp": "\\bmemory\\b"
          },
          {
            "regexp": "\\becho\\b"
          },
          {
            "regexp": "\\btitle\\b"
          },
          {
            "regexp": "\\b(?:no)?print\\b"
          },
          {
            "regexp": "\\b(?:un)?set\\b"
          },
          {
            "regexp": "\\bstop\\b"
          },
          {
            "regexp": "\\btask\\b"
          },
          {
            "regexp": "\\becce_print\\b"
          },
          {
            "regexp": "\\bcharge\\b"
          },
          {
            "regexp": "\\bgeometry\\b"
          },
          {
            "regexp": "\\bbasis\\b"
          },
          {
            "regexp": "\\bspherical\\b"
          },
          {
            "regexp": "\\blibrary\\b"
          },
          {
            "regexp": "\\bend\\b"
          },
          {
            "regexp": "\\bxc\\b"
          },
          {
            "regexp": "\\bmult\\b"
          },
          {
            "regexp": "\\bfreeze atomic\\b"
          },
          {
            "regexp": "\\b(?:no)?center\\b"
          },
          {
            "regexp": "\\bbqbq\\b"
          },
          {
            "regexp": "\\bnuc(?:l(?:eus)?)?\\b"
          },
          {
            "regexp": "\\bscf\\b"
          },
          {
            "regexp": "\\bdft\\b"
          },
          {
            "regexp": "\\bmp2\\b"
          },
          {
            "regexp": "\\bccsd\\b"
          },
          {
            "regexp": "\\bunits\\b"
          },
          {
            "regexp": "\\bautosym\\b"
          }
        ],
        "format": {
          "preset": "keyword"
        }
      },
      {
        "patterns": [
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "units\\s+(an|angstroms|au|atom|bohr|nm|nanometers|pm|picometers)"
          },
          {
            "regexp": "(?:re)?start\\s+([^;\\n]+)"
          },
          {
            "regexp": "\\bprint\\s+(xyz)"
          },
          {
            "regexp": "\\autosym\\s+([-\\d.eEdD+]+)"
          },
          {
            "regexp": "\\bnuc(?:l(?:eus)?)?\\s+([^\\s;]+)"
          },
          {
            "regexp": "\\bbasis\\s+(spherical)\\b"
          },
          {
            "regexp": "\\bxc\\s+([^\\n]+)\\b"
          },
          {
            "regexp": "\\btask\\s+((?:mc)?scf|(:?so)?dft|(:?direct_|ri)?mp2|ccsd(:?\\(t\\))?|selci|md|pspw|band|tce)\\s+(energy|gradient|optimize|saddle|hessian|freq(?:uencies)?|property|(?:thermo)?dynamics)\\b"
          }
        ],
        "format": {
          "preset": "property"
        }
      },
      {
        "patterns": [
          {
            "regexp": "title\\s+\"(.+)\""
          }
        ],
        "format": {
          "preset": "title"
        }
      },
      {
        "patterns": [
          {
            "regexp": "#[^\n]*"
          }
        ],
        "format": {
          "preset": "comment"
        }
      }
    ]
  }
}

which, in the interest of a fair comparison could be rewritten (though no one has till now) as:

{
  "default": {
    "rules": [
      {
        "patterns": [
          { "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b" }
        ],
        "format": { "preset": "literal" }
      },
      {
        "patterns": [
          { "regexp": "\\b(?:re)?start\\b" },
          { "regexp": "\\b(?:scratch|permanent)?_dir\\b" },
          { "regexp": "\\bmemory\\b" },
          { "regexp": "\\becho\\b" },
          { "regexp": "\\btitle\\b" },
          { "regexp": "\\b(?:no)?print\\b" },
          { "regexp": "\\b(?:un)?set\\b" },
          { "regexp": "\\bstop\\b" },
          { "regexp": "\\btask\\b" },
          { "regexp": "\\becce_print\\b" },
          { "regexp": "\\bcharge\\b" },
          { "regexp": "\\bgeometry\\b" },
          { "regexp": "\\bbasis\\b" },
          { "regexp": "\\bspherical\\b" },
          { "regexp": "\\blibrary\\b" },
          { "regexp": "\\bend\\b" },
          { "regexp": "\\bxc\\b" },
          { "regexp": "\\bmult\\b" },
          { "regexp": "\\bfreeze atomic\\b" },
          { "regexp": "\\b(?:no)?center\\b" },
          { "regexp": "\\bbqbq\\b" },
          { "regexp": "\\bnuc(?:l(?:eus)?)?\\b" },
          { "regexp": "\\bscf\\b" },
          { "regexp": "\\bdft\\b" },
          { "regexp": "\\bmp2\\b" },
          { "regexp": "\\bccsd\\b" },
          { "regexp": "\\bunits\\b" },
          { "regexp": "\\bautosym\\b"
          }
        ],
        "format": { "preset": "keyword" }
      },
      {
        "patterns": [
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "units\\s+(an|angstroms|au|atom|bohr|nm|nanometers|pm|picometers)" },
          { "regexp": "(?:re)?start\\s+([^;\\n]+)" },
          { "regexp": "\\bprint\\s+(xyz)" },
          { "regexp": "\\autosym\\s+([-\\d.eEdD+]+)" },
          { "regexp": "\\bnuc(?:l(?:eus)?)?\\s+([^\\s;]+)" },
          { "regexp": "\\bbasis\\s+(spherical)\\b" },
          { "regexp": "\\bxc\\s+([^\\n]+)\\b" },
          { "regexp": "\\btask\\s+((?:mc)?scf|(:?so)?dft|(:?direct_|ri)?mp2|ccsd(:?\\(t\\))?|selci|md|pspw|band|tce)\\s+(energy|gradient|optimize|saddle|hessian|freq(?:uencies)?|property|(?:thermo)?dynamics)\\b" }
        ],
        "format": { "preset": "property" }
      },
      {
        "patterns": [
          { "regexp": "title\\s+\"(.+)\"" }
        ],
        "format": { "preset": "title" }
      },
      {
        "patterns": [
          { "regexp": "#[^\n]*" }
        ],
        "format": { "preset": "comment" }
      }
    ]
  }
}

becomes as TOML:

[[default.rules]]
format.preset = "literal"
patterns = [
    { regexp = '\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\b' },
]

[[default.rules]]
format.preset = "keyword"
patterns = [
    { regexp = '\b(?:re)?start\b' },
    { regexp = '\b(?:scratch|permanent)?_dir\b' },
    { regexp = '\bmemory\b' },
    { regexp = '\becho\b' },
    { regexp = '\btitle\b' },
    { regexp = '\b(?:no)?print\b' },
    { regexp = '\b(?:un)?set\b' },
    { regexp = '\bstop\b' },
    { regexp = '\btask\b' },
    { regexp = '\becce_print\b' },
    { regexp = '\bcharge\b' },
    { regexp = '\bgeometry\b' },
    { regexp = '\bbasis\b' },
    { regexp = '\bspherical\b' },
    { regexp = '\blibrary\b' },
    { regexp = '\bend\b' },
    { regexp = '\bxc\b' },
    { regexp = '\bmult\b' },
    { regexp = '\bfreeze atomic\b' },
    { regexp = '\b(?:no)?center\b' },
    { regexp = '\bbqbq\b' },
    { regexp = '\bnuc(?:l(?:eus)?)?\b' },
    { regexp = '\bscf\b' },
    { regexp = '\bdft\b' },
    { regexp = '\bmp2\b' },
    { regexp = '\bccsd\b' },
    { regexp = '\bunits\b' },
    { regexp = '\bautosym\b' },
]

[[default.rules]]
format.preset = "property"
patterns = [
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = 'units\s+(an|angstroms|au|atom|bohr|nm|nanometers|pm|picometers)' },
    { regexp = '(?:re)?start\s+([^;\\n]+)' },
    { regexp = '\bprint\s+(xyz)' },
    { regexp = '\autosym\s+([-\d.eEdD+]+)' },
    { regexp = '\bnuc(?:l(?:eus)?)?\s+([^\s;]+)' },
    { regexp = '\bbasis\s+(spherical)\b' },
    { regexp = '\bxc\s+([^\n]+)\b' },
    { regexp = '\btask\s+((?:mc)?scf|(:?so)?dft|(:?direct_|ri)?mp2|ccsd(:?\(t\))?|selci|md|pspw|band|tce)\s+(energy|gradient|optimize|saddle|hessian|freq(?:uencies)?|property|(?:thermo)?dynamics)\b' },
]

[[default.rules]]
format.preset = "title"
patterns = [
    { regexp = 'title\s+"(.+)"' },
]

[[default.rules]]
format.preset = "comment"
patterns = [
    { regexp = '#[^\n]*' },
]

Speaking from personal experience, such endless escaping (cf: trying to use { characters in Python f-strings) is a huge source of errors. In fact, I believe the final regex in the JSON ("#[^\n]*") is indeed such an error!

Also, interestingly, after having converted a few of the JSONs of avogenerators to TOML I’ve found that TOML reduces the file sizes by 20–30%. (Removing all the indentation from the JSON also reduces the size of course, by even a little bit more, but it becomes unreadable.)

1 Like

As someone who is about to embark on writing the syntax highlighting for the ORCA input generator, I think switching to TOML would be indispensable for regex. Not needing to escape the escapes is a big improvement over JSON, and will make things significantly easier.

1 Like

It’s a good point. I’ll see what would be the easiest refactor for this.

1 Like

Picking up this thread again.

Almost all of what has been discussed has been implemented now, and both the user options and syntax highlighting rules may be in either JSON format or TOML.

The final piece of the puzzle is the way in which a tabbed interface should be described in user options.

Current schema

No tabs

For reference, a basic untabbed interface is declared like in this basic example (the JSON and TOML are equivalent):

{
  "widget1": {
    "type": "stringList",
    "default": 1,
    "values": [
      "value0",
      "value1",
      "value2"
    ]
  },
  "widget2": {
    "type": "integer",
    "default": 1,
    "minimum": 0,
    "maximum": 255
  }
}
[widget1]
type = "stringList"
default = 1
values = [
    "value0",
    "value1",
    "value2",
]

[widget2]
type = "integer"
default = 1
minimum = 0
maximum = 255

Tabs

[
  {
    "tabName": "tabA",
    "widgetA1": {
      "type": "stringList",
      "default": 1,
      "values": [
        "value0",
        "value1",
        "value2",
      ]
    },
    "widgetA2": {
      "type": "integer",
      "default": 1,
      "minimum": 0,
      "maximum": 255
    }
  },
  {
    "tabName": "tabB",
    "widgetB1": {
      "type": "string",
      "default": ""
    }
  }
]

Unfortunately this can’t be represented as TOML, because the top level of a TOML document has to be a table, not an array.

Proposed new options for specifying tabs

Here’s the best we’ve come up with so far.

In all cases the way to specify an untabbed interface would be the same as above, unchanged.

Idea 1 – Nest the array under a tabs key

{
  "tabs": [
    {
      "tabName": "tabA",
      "widgetA1": {
        "type": "stringList",
        "default": 1,
        "values": [
          "value0",
          "value1",
          "value2",
        ]
      },
      "widgetA2": {
        "type": "integer",
        "default": 1,
        "minimum": 0,
        "maximum": 255
      }
    },
    {
      "tabName": "tabB",
      "widgetB1": {
        "type": "string",
        "default": ""
      }
    }
  ]
}
[[tabs]]
tabName = "tabA"

[tabs.widgetA1]
type = "stringList"
default = 1
values = [
    "value0",
    "value1",
    "value2",
]

[tabs.widgetA2]
type = "integer"
default = 1
minimum = 0
maximum = 255

[[tabs]]
tabName = "tabB"

[tabs.widgetB1]
type = "string"
default = ""
  • Would be the simplest to actually implement – it’s identical to the current schema, just with the extra level of testing
  • Works fine in JSON, but not great in TOML (the tabs everywhere feels really unnecessary, and no one really like the array of tables syntax in TOML)
  • Hard to look at a widget and see which tab it’s in, in both JSON and TOML

Idea 2 – Tabs as top-level objects/tables with widgets within them

{
  "tabbed": true,
  "tabA": {
    "widgetA1": {
      "type": "stringList",
      "default": 1,
      "values": [
        "value0",
        "value1",
        "value2"
      ]
    },
    "widgetA2": {
      "type": "integer",
      "default": 1,
      "minimum": 0,
      "maximum": 255
    }
  },
  "tabB": {
    "widgetB1": {
      "type": "string",
      "default": ""
    }
  }
}
tabbed = true

[tabA.widgetA1]
type = "stringList"
default = 1
values = [
    "value0",
    "value1",
    "value2",
]

[tabA.widgetA2]
type = "integer"
default = 1
minimum = 0
maximum = 255

[tabB.widgetB1]
type = "string"
default = ""
  • Having the widgets as children of the tabs they belong to feels to us like the most natural structure?
  • Plays to TOML’s strengths
  • Very obvious in the TOML which tab each widget belongs to, less so in JSON but still the most obvious out of all three ideas
  • Tabs aren’t guaranteed to be in the same order as in the file; would probably solve this by allowing optionally specifying the tab order, something like:
[tabA]
tabPosition = 0

[tabA.widgetA1]
type = "stringList"
...

[tabB]
tabPosition = 1

[tabB.widgetB1]
type = "string"
...
  • Along the same lines it’s in general easy to specify attributes for a tab as a whole, should that ever become a thing

Idea 3 – Widgets at top level, just like when not tabbed, tab as attribute

{
  "tabs": [
    "tabA",
    "tabB"
  ],
  "widgetA1": {
    "type": "stringList",
    "default": 1,
    "values": [
      "value0",
      "value1",
      "value2"
    ],
    "tab": 0
  },
  "widgetA2": {
    "type": "integer",
    "default": 1,
    "minimum": 0,
    "maximum": 255,
    "tab": 0
  },
  "widgetB1": {
    "type": "string",
    "default": "",
    "tab": 1
  }
}
tabs = ["tabA", "tabB"]

[widgetA1]
type = "stringList"
default = 1
values = [
    "value0",
    "value1",
    "value2",
]
tab = 0

[widgetA2]
type = "integer"
default = 1
minimum = 0
maximum = 255
tab = 0

[widgetB1]
type = "string"
default = ""
tab = 1
  • Consistent with the normal, untabbed version
  • Requires all widgets to have unique identifiers across all tabs (may be good or bad?)
  • No grouping by tab :cry:
  • Which tab a widget belongs to requires cross-referencing the array of tab names

Thoughts? Preferences? Alternative suggestions?

This is important. Right now, it doesn’t matter what tab a widget is on, the options are collected across the entire interface. Qt is also unhappy when you have widgets with duplicate names. (Yes, you could give it a tab.widget identifier.)

I’m relatively ambivalent about the syntax for tabs, but IMHO that’s a requirement that all widgets need to have unique identifiers. I tend to also prefer JSON, but if there’s interest in using TOML, that sounds fine.

My other suggestion, which I mentioned to @brockdyer03 is that the “basics” should have fairly good defaults. (And while I don’t encourage B3LYP, I don’t think we can totally remove it from the interface.)

I had a suspicion it might be.

Well, naturally it’s also possible just to tell authors that the names need to be unique, so it’s not like the other schemas are ruled out by that. The current schema is just like that – authors simply have to know that they shouldn’t reuse names across tabs.

But having all options/widgets be members of the same object/table does neatly enforce the uniqueness.

@brockdyer03 and I also discussed the fact that it’s much more intuitive if the structure the plugin uses for userOptions mirrors that of the options that Avogadro passes to the plugin (containing the user’s choices).

Since options is a flat object containing all the options as key/value pairs, not organised by tab, it matches idea (3). If we went for (1) or (2) then we both agreed that it’d make sense to also organise options by tab.

Idea (3) might be a little more palatable as well if the tab of each widget was specified by a string rather than an index e.g.

  "widgetB1": {
    "tab": "tabB",
    "type": "string",
    "default": ""
  }
[widgetB1]
tab = "tabB"
type = "string"
default = ""

There’d presumably be an efficiency cost, but it’d be much more readable and editable without having to constantly cross-reference (“Oh, which tab was tab 2 again?”)

I’m strongly in favor of “Idea 2”. It, as @matterhorn103 suggested, strongly plays to TOML’s strengths, it makes it easier to organize, and it would probably be the most efficient in terms of lookup since you don’t have to do any cross-referencing.

In the case of the ORCA generator rework I’ve been working on, I am aiming to have the options.toml file and the syntax.toml file be 100% auto-generated using the items specified in the code. In the interest of making that happen, it would be extremely nice to be able to create an object that can use a top-down hierarchy (like that outlined in idea 2) and not need to flatten everything out and have an extra attribute in each item.

I would still be okay requiring unique names though, since I while I am interested in writing the C++ end of this myself, I am not entirely thrilled about having to do a lot of Qt shenanigans.

Yeah, I think that’s a given that that will still be a requirement.

I managed to get this working (according to schema 3, but with tabs indicated by strings) in this PR:

1 Like