Policy on AI / LLM Contributions

In part because I’m dealing with this in classes, I want to bring up some guidelines for AI / LLM policy for contributions to Avogadro.

I think at the moment, my policy is against code contributions. Despite claims by Claude, ChatGPT, etc. when I’ve evaluated long-form C++ suggestions, they are often flawed. (It’s a bit different when it’s a line-by-line suggestion tool, although they still aren’t great to be honest.)

Where I have found it somewhat useful on the code side is learning - for example, I’m not that familiar with modern OpenGL programming, so seeing some examples of GLSL shaders was helpful.

I’m somewhat more open to use to help write / edit the documentation, but would appreciate different opinions. I think we already have a start from the Avogadro v1 docs, and the most important thing is updating screenshots / step-by-step tutorials. But I can see that tweaking grammar, etc. (as opposed to wholesale writing paragraphs / pages) might be useful.

Perhaps that ended up in a similar place:

  • :green_circle: Line by line edit / suggestions with human overview
  • :red_circle: Large-scale modifications / writing chunks of code or documentation

Other thoughts? Is there a yellow level?

Since I’ve seen this come up in many different spaces now over the past year or so, and I always have the same question, I’ll finally ask it: do you have any ideas or plans for how to enforce a policy?

My working hypothesis is that it’s easy to prevent obvious drive-by changes (commits with Claude/etc. as co-author, giant PRs with code that isn’t consistent with the author or is internally inconsistent regarding complexity or style) but small changes or contributions that work hard to cover up their provenance are impossible to identify the source of.

As you might imagine, this subject has come up in the academic side. For example, for writing there’s now Pangram which claims to detect AI and has actually been adopted by the University of Maryland. (Several of us are trying it out at Pitt as part of an AI-in-the-classroom workshop.)

I’m not aware of any AI detector for code, although I suspect it would be possible. (e.g., train on a bunch of code prior to AI tools, train on a bunch of output from ChatGPT, Claude, Gemini, etc.)

How am I going to enforce the policy? Code review.

I don’t like large patches anyway. Small changes that are impossible to identify the provenance? I think that falls close to the line-by-line auto-complete suggestions. IIRC the Free Software Foundation decided in the 90s and early 2000s that code changes shorter than ~10 lines don’t require a copyright assignment. At what point does a bunch of small changes add up? :man_shrugging:

If others have good suggestions, I’m open to them.

Well presumably at some point you just have to rely on people complying with the project’s guidelines on good faith, same as for other things such as people only contributing code they own the copyright to, which is also pretty unpoliceable.

By the way, though I never got round to replying, I basically agree with your point of view, and it aligns with my own personal gut feeling that the acceptable level of AI assistance for something you’re submitting in your own name is broadly the same as the acceptable level of assistance from another person.

Yes - in the academic context, I discussed with my class this term:

  • is it acceptable to ask a friend for help revising a sentence that seems awkward? (seems okay)
  • is it acceptable to ask a friend to write a paragraph for you (seems problematic)
  • is it acceptable to ask a friend to write the paper for you? (clearly unacceptable)

On the side that the vast majority of people behave ethically, I suspect we can come up with reasonable :green_circle: / :stop_sign: guidelines.

I’m working on some AI assistance guidelines and “prompt declarations” for class:

  • what tool(s) you used and versions
  • what your prompt included
  • how you validated the response
  • what you included / discarded modified

There might be some use in this for Avogadro as well - particularly since contributors are included on publications (e.g., there will definitely be an Avogadro v2 manuscript).

I wanted to bring this thread back because I saw an article today by Dan Shapiro on "the five levels from spicy autocomplete to the software factory

  1. Spicy autocomplete, aka original GitHub Copilot or copying and pasting snippets from ChatGPT.
  2. The coding intern, writing unimportant snippets and boilerplate with full human review.
  3. The junior developer, pair programming with the model but still reviewing every line.
  4. The developer. Most code is generated by AI, and you take on the role of full-time code reviewer.
  5. The engineering team. You’re more of an engineering manager or product/program/project manager. You collaborate on specs and plans, the agents do the work.
  6. The dark software factory, like a factory run by robots where the lights are out because robots don’t need to see.

I think the biggest take-home is that for Avogadro, code needs to be correct particularly for science-related topics. The force fields, the molecular dynamics, molecular surfaces, orbitals, etc. needs substantial human intervention and review.

I’ve been playing with Claude Opus 4.5 because Pitt has a license and I can say it’s pretty good at levels 0 and 1, particularly boilerplate. But it absolutely fails more advanced math (because it’s a text engine) and plenty of chemistry.

So I’ll go with a policy of “levels 0 and 1 are probably fine.” For example “help me generate some Qt C++ code for a dialog that shows …”

I might also add:

  • discussing a high-level plan that’s implemented by humans / level 0 seems okay, e.g., “can you help me plan mechanisms to provide a simple security sandbox for scripts. What are some pros and cons of the different approaches?”
  • using various AI tools to help code review (e.g., CodeRabbit seems useful but does not replace human review)
  • using an AI tool to guide debugging / bug fixing – that’s implemented by humans (e.g. “there seems to be a race condition in the properties dialog during vibration, can you suggest some possible solutions?”)

I think these fall into the green / yellow criteria.

I reserve the right to reject contributions that seem like they were generated mostly by an ML tool, and code review will continue, particularly for any chemistry-specific or mathematical components.

Should I write this up into some sort of AGENTS.md file or onto the website?

As a policy, that sets the threshold at about the perfect level IMO.

I suspect a document somewhere on GitHub stands a marginally higher chance of being read by people who would tend to submit policy-breaching contributions than if it were on the website.

As it turns out @erb74, someone did train an AI code classifier (for Python):

“Who is using AI to code? Global diffusion and impact of generative AI”
doi: 10.1126/science.adz9311

We train a neural classifier to spot AI-generated Python functions in over 30 million GitHub commits by 160,097 software developers, tracking how fast, and where, these tools take hold.

Not clear to me if their classifier is available for us, although there’s a large data Dryad connected to the article.