`userOptions` and `highlightStyles`

So as it stands, the JSON files that a plugin should in future provide for userOptions and highlightStyles are identical to the output that the plugins should currently return upon being called with --print-options (and highlightStyles being a subset of that output).

While working on avo-plugin-demo and avogenerators I felt that it made little sense to have the top level key in each and that it’d be better to remove one layer of nesting in the JSON files.
And I figured the same could be done for the syntax highlighting rules, until I realised that highlightStyles is an array and therefore can’t have the same thing done.

As such, I just decided it wasn’t a big deal and left it for the moment.

When I asked @brockdyer03 to look over my PRs, though, this was one of the things he complained about. So since it turns out to not just be me who’s bothered by it, I’m taking that as a sign that it is a genuine concern. :smiley:

Could we please do the following:

  1. Drop the top-level object in the userOptions JSON that only contains the userOptions key. Note this would also apply when obtained dynamically. For example, instead of this for NWChem:
// options.json
{
  "userOptions": {
    "Title": {
      "type": "string",
      "default": ""
    },
    "Calculation Type": {
      "type": "stringList",
      "default": 1,
      "values": [
        "Single Point",
        "Equilibrium Geometry",
        "Frequencies"
      ]
    },
    ...
  }
}

it’d be just this:

// options.json
{
  "Title": {
    "type": "string",
    "default": ""
  },
  "Calculation Type": {
    "type": "stringList",
    "default": 1,
    "values": [
      "Single Point",
      "Equilibrium Geometry",
      "Frequencies"
    ]
  },
  ...
}

This also makes sense since the only other things that were till now part of the --print-options JSON were things that are now expressed in other places.

  1. Drop the top-level object of the highlightRules JSON, and rather than having an array of highlightStyles where each style has a style key to specify the style name, have an object where the keys are the style names and the values are the rest. For example, instead of this for NWChem:
// syntax.json
{
  "highlightStyles": [
    {
      "style": "default",
      "rules": [
        {
          "patterns": [
            {
              "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b"
            }
          ],
          "format": {
            "preset": "literal"
          }
        },
        ...
      ]
    }
  ]
}

it’d be this:

// syntax.json
{
  "default": {
    "rules": [
      {
        "patterns": [
          {
            "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b"
          }
        ],
        "format": {
          "preset": "literal"
        }
      },
      ...
    ]
  }
}

Much more obvious in both cases, if you ask me, and I think the less nesting the less confusing for anyone wanting to edit an existing generator.

The highlight rules seem fine. I’m less sure about the userOptions because sometimes it’ll have that and sometimes not.. like if an input format is requested? Which I thought we would just generally do anyway?

The only three things in the output of --print-options currently are:

{
  "userOptions": {
    ...
  },
  "highlightStyles" : [
  ...
  ],
  "inputMolecularFormat": "cjson"
}

highlightStyles is now contained in a separate JSON, and the input format is now specified as input-format in the plugin’s metadata TOML (pyproject.toml or avogadro.toml).

So in fact, thinking about it, the current way the new JSON files are structured, with the top-level keys, doesn’t make sense – each one is essentially the --print-options output but with the other two parts removed. Each JSON ought instead to contain just the value of the userOptions or highlightStyles key/value pair.

What do you mean?

1 Like

@brockdyer03 has additionally requested that it also be possible to use TOML files for this purpose if desired. How would you feel about allowing either?

Both are able to represent the data equivalently, and Avogadro will have parsers for both and could just detect the file extension.

Right now all the code for handling the highlight rules is set up for JSON. So it’s possible to support TOML, but it would take some work to refactor that.

Fair enough. For some reasons, I was thinking the userOptions had the inputMolecularFormat as one of the keys.

Maybe not too much, though, since as soon as it’s parsed into a C++ data structure it’s identical? :innocent:

It’s especially the highlight rules that would be extremely nice to be able to use TOML for, because they are full of regex and therefore the JSON is full of backslash escapes, but TOML provides literal strings. Thus the NWChem highlight rules:

{
  "default": {
    "rules": [
      {
        "patterns": [
          {
            "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b"
          }
        ],
        "format": {
          "preset": "literal"
        }
      },
      {
        "patterns": [
          {
            "regexp": "\\b(?:re)?start\\b"
          },
          {
            "regexp": "\\b(?:scratch|permanent)?_dir\\b"
          },
          {
            "regexp": "\\bmemory\\b"
          },
          {
            "regexp": "\\becho\\b"
          },
          {
            "regexp": "\\btitle\\b"
          },
          {
            "regexp": "\\b(?:no)?print\\b"
          },
          {
            "regexp": "\\b(?:un)?set\\b"
          },
          {
            "regexp": "\\bstop\\b"
          },
          {
            "regexp": "\\btask\\b"
          },
          {
            "regexp": "\\becce_print\\b"
          },
          {
            "regexp": "\\bcharge\\b"
          },
          {
            "regexp": "\\bgeometry\\b"
          },
          {
            "regexp": "\\bbasis\\b"
          },
          {
            "regexp": "\\bspherical\\b"
          },
          {
            "regexp": "\\blibrary\\b"
          },
          {
            "regexp": "\\bend\\b"
          },
          {
            "regexp": "\\bxc\\b"
          },
          {
            "regexp": "\\bmult\\b"
          },
          {
            "regexp": "\\bfreeze atomic\\b"
          },
          {
            "regexp": "\\b(?:no)?center\\b"
          },
          {
            "regexp": "\\bbqbq\\b"
          },
          {
            "regexp": "\\bnuc(?:l(?:eus)?)?\\b"
          },
          {
            "regexp": "\\bscf\\b"
          },
          {
            "regexp": "\\bdft\\b"
          },
          {
            "regexp": "\\bmp2\\b"
          },
          {
            "regexp": "\\bccsd\\b"
          },
          {
            "regexp": "\\bunits\\b"
          },
          {
            "regexp": "\\bautosym\\b"
          }
        ],
        "format": {
          "preset": "keyword"
        }
      },
      {
        "patterns": [
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "\\*\\s+library\\s+([^\\n]+)"
          },
          {
            "regexp": "units\\s+(an|angstroms|au|atom|bohr|nm|nanometers|pm|picometers)"
          },
          {
            "regexp": "(?:re)?start\\s+([^;\\n]+)"
          },
          {
            "regexp": "\\bprint\\s+(xyz)"
          },
          {
            "regexp": "\\autosym\\s+([-\\d.eEdD+]+)"
          },
          {
            "regexp": "\\bnuc(?:l(?:eus)?)?\\s+([^\\s;]+)"
          },
          {
            "regexp": "\\bbasis\\s+(spherical)\\b"
          },
          {
            "regexp": "\\bxc\\s+([^\\n]+)\\b"
          },
          {
            "regexp": "\\btask\\s+((?:mc)?scf|(:?so)?dft|(:?direct_|ri)?mp2|ccsd(:?\\(t\\))?|selci|md|pspw|band|tce)\\s+(energy|gradient|optimize|saddle|hessian|freq(?:uencies)?|property|(?:thermo)?dynamics)\\b"
          }
        ],
        "format": {
          "preset": "property"
        }
      },
      {
        "patterns": [
          {
            "regexp": "title\\s+\"(.+)\""
          }
        ],
        "format": {
          "preset": "title"
        }
      },
      {
        "patterns": [
          {
            "regexp": "#[^\n]*"
          }
        ],
        "format": {
          "preset": "comment"
        }
      }
    ]
  }
}

which, in the interest of a fair comparison could be rewritten (though no one has till now) as:

{
  "default": {
    "rules": [
      {
        "patterns": [
          { "regexp": "\\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\\b" }
        ],
        "format": { "preset": "literal" }
      },
      {
        "patterns": [
          { "regexp": "\\b(?:re)?start\\b" },
          { "regexp": "\\b(?:scratch|permanent)?_dir\\b" },
          { "regexp": "\\bmemory\\b" },
          { "regexp": "\\becho\\b" },
          { "regexp": "\\btitle\\b" },
          { "regexp": "\\b(?:no)?print\\b" },
          { "regexp": "\\b(?:un)?set\\b" },
          { "regexp": "\\bstop\\b" },
          { "regexp": "\\btask\\b" },
          { "regexp": "\\becce_print\\b" },
          { "regexp": "\\bcharge\\b" },
          { "regexp": "\\bgeometry\\b" },
          { "regexp": "\\bbasis\\b" },
          { "regexp": "\\bspherical\\b" },
          { "regexp": "\\blibrary\\b" },
          { "regexp": "\\bend\\b" },
          { "regexp": "\\bxc\\b" },
          { "regexp": "\\bmult\\b" },
          { "regexp": "\\bfreeze atomic\\b" },
          { "regexp": "\\b(?:no)?center\\b" },
          { "regexp": "\\bbqbq\\b" },
          { "regexp": "\\bnuc(?:l(?:eus)?)?\\b" },
          { "regexp": "\\bscf\\b" },
          { "regexp": "\\bdft\\b" },
          { "regexp": "\\bmp2\\b" },
          { "regexp": "\\bccsd\\b" },
          { "regexp": "\\bunits\\b" },
          { "regexp": "\\bautosym\\b"
          }
        ],
        "format": { "preset": "keyword" }
      },
      {
        "patterns": [
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "\\*\\s+library\\s+([^\\n]+)" },
          { "regexp": "units\\s+(an|angstroms|au|atom|bohr|nm|nanometers|pm|picometers)" },
          { "regexp": "(?:re)?start\\s+([^;\\n]+)" },
          { "regexp": "\\bprint\\s+(xyz)" },
          { "regexp": "\\autosym\\s+([-\\d.eEdD+]+)" },
          { "regexp": "\\bnuc(?:l(?:eus)?)?\\s+([^\\s;]+)" },
          { "regexp": "\\bbasis\\s+(spherical)\\b" },
          { "regexp": "\\bxc\\s+([^\\n]+)\\b" },
          { "regexp": "\\btask\\s+((?:mc)?scf|(:?so)?dft|(:?direct_|ri)?mp2|ccsd(:?\\(t\\))?|selci|md|pspw|band|tce)\\s+(energy|gradient|optimize|saddle|hessian|freq(?:uencies)?|property|(?:thermo)?dynamics)\\b" }
        ],
        "format": { "preset": "property" }
      },
      {
        "patterns": [
          { "regexp": "title\\s+\"(.+)\"" }
        ],
        "format": { "preset": "title" }
      },
      {
        "patterns": [
          { "regexp": "#[^\n]*" }
        ],
        "format": { "preset": "comment" }
      }
    ]
  }
}

becomes as TOML:

[[default.rules]]
format.preset = "literal"
patterns = [
    { regexp = '\b[+-]?[.0-9]+(?:[eEdD][+-]?[.0-9]+)?\b' },
]

[[default.rules]]
format.preset = "keyword"
patterns = [
    { regexp = '\b(?:re)?start\b' },
    { regexp = '\b(?:scratch|permanent)?_dir\b' },
    { regexp = '\bmemory\b' },
    { regexp = '\becho\b' },
    { regexp = '\btitle\b' },
    { regexp = '\b(?:no)?print\b' },
    { regexp = '\b(?:un)?set\b' },
    { regexp = '\bstop\b' },
    { regexp = '\btask\b' },
    { regexp = '\becce_print\b' },
    { regexp = '\bcharge\b' },
    { regexp = '\bgeometry\b' },
    { regexp = '\bbasis\b' },
    { regexp = '\bspherical\b' },
    { regexp = '\blibrary\b' },
    { regexp = '\bend\b' },
    { regexp = '\bxc\b' },
    { regexp = '\bmult\b' },
    { regexp = '\bfreeze atomic\b' },
    { regexp = '\b(?:no)?center\b' },
    { regexp = '\bbqbq\b' },
    { regexp = '\bnuc(?:l(?:eus)?)?\b' },
    { regexp = '\bscf\b' },
    { regexp = '\bdft\b' },
    { regexp = '\bmp2\b' },
    { regexp = '\bccsd\b' },
    { regexp = '\bunits\b' },
    { regexp = '\bautosym\b' },
]

[[default.rules]]
format.preset = "property"
patterns = [
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = '\*\s+library\s+([^\n]+)' },
    { regexp = 'units\s+(an|angstroms|au|atom|bohr|nm|nanometers|pm|picometers)' },
    { regexp = '(?:re)?start\s+([^;\\n]+)' },
    { regexp = '\bprint\s+(xyz)' },
    { regexp = '\autosym\s+([-\d.eEdD+]+)' },
    { regexp = '\bnuc(?:l(?:eus)?)?\s+([^\s;]+)' },
    { regexp = '\bbasis\s+(spherical)\b' },
    { regexp = '\bxc\s+([^\n]+)\b' },
    { regexp = '\btask\s+((?:mc)?scf|(:?so)?dft|(:?direct_|ri)?mp2|ccsd(:?\(t\))?|selci|md|pspw|band|tce)\s+(energy|gradient|optimize|saddle|hessian|freq(?:uencies)?|property|(?:thermo)?dynamics)\b' },
]

[[default.rules]]
format.preset = "title"
patterns = [
    { regexp = 'title\s+"(.+)"' },
]

[[default.rules]]
format.preset = "comment"
patterns = [
    { regexp = '#[^\n]*' },
]

Speaking from personal experience, such endless escaping (cf: trying to use { characters in Python f-strings) is a huge source of errors. In fact, I believe the final regex in the JSON ("#[^\n]*") is indeed such an error!

Also, interestingly, after having converted a few of the JSONs of avogenerators to TOML I’ve found that TOML reduces the file sizes by 20–30%. (Removing all the indentation from the JSON also reduces the size of course, by even a little bit more, but it becomes unreadable.)

1 Like

As someone who is about to embark on writing the syntax highlighting for the ORCA input generator, I think switching to TOML would be indispensable for regex. Not needing to escape the escapes is a big improvement over JSON, and will make things significantly easier.

1 Like

It’s a good point. I’ll see what would be the easiest refactor for this.

1 Like