GSoC 2018 Project: Molecular Dynamics support for Avogadro 2

badarsh2 · June 17, 2018, 8:40pm

Hello,

I am Adarsh, a senior student from Indian Institute of Technology Madras, and will be joining Johns Hopkins University this fall for my Masters in Materials Science and Engineering. I am working on bringing support for reading and analysing Molecular Dynamics trajectories in Avogadro as part of Google Summer of Code 2018. I thought I shall use this opportunity to share my work and what I have in my mind to implement in the next couple of weeks.

So far, I have written readers for some of the most common MD trajectory formats:

LAMMPS dump,
trr (GROMACS),
dcd (CHARMM, OpenMM etc.)

Support for reading xyz trajectories already exists.

I will be working on enhancing the player tool to add more functionality like easy navigation from one frame to another, better support for media export and shall explore avenues for performance improvements. I shall also be working on tools for analysing common MD parameters like differential displacement, energy evolution, RMSD etc.

As of now, a couple of bottlenecks have sprung to my mind that require appropriate addressal:

Many of the trajectories do not provide information about the elements. Some of the formats stick to the literal sense of trajectory, by having only the positions and the velocities. If the system is small, the Custom Element Mapping tool can be used to manually assign the elements. But to account for systems with large number of atoms, I have planned to include a build option to assign custom elements from file. For example, in case of reading a GROMACS trr trajectory, one can use this tool to import a .gro file to get information about the constituent elements. This is also a way to make use of the power of OpenBabel’s readers if needed!
Many of the trajectories are huge in size and require sufficient memory to store all of the coordinate sets. I still am not sure of how to address this: more intelligent use of file pointers is required, perhaps like dynamically navigating to timesteps under current focus instead of storing all of it in memory. This also requires me to think about the tradeoffs involved: efficient reading vs efficient storage.

Now that I’ve given a brief overview of the project, I would like to hear from you, as to

What else you would like to see with respect to analyzing MD trajectories: trajectory tools, additional analysis tools, media exports etc.
What other trajectory file formats you would like to see support for
What solutions / suggestions you have in mind to address some of the bottlenecks

Looking forward to your valuable comments!

Thanks,

Adarsh

ghutchis · June 17, 2018, 8:47pm

Thanks for the post - seems like you’re making great progress so far!

I think saving sets of images, GIF, AVI would be great as far as media exports, but might take some work.
Being able to plot radial distribution functions is often useful, but hopefully others will have suggestions - my group doesn’t do a lot of MD
I’d guess that Amber trajectories would be appreciated. There more here
We probably need a way to connect a trajectory file to a loaded PDB, Mol2, or other ‘topology’ file. Maybe there’s a way users can load the ‘topology’ (elements, bonds, etc.) regularly and then have another command or the player tool to load the trajectory?

Luthaf · June 20, 2018, 2:20pm

Hello!

I found this topic from computational chemistry daily, and I wanted to present to you a library that can help you to improve the molecular dynamics support in Avogadro: chemfiles.

Full disclaimer: I am the main author of chemfiles =)

The idea behind chemfiles is to be an adapter between application code and chemical file format: instead of having every application implement it’s own reader/writer code for every single format, the goal is to share the effort and the implementation in a single high quality library that can then be used by multiple applications.

On the technical side, chemfiles is a C++11, BSD licensed library, with bindings for C, Fortran, Python, Julia and Rust. It can currently read 15 different formats, with 4 binary formats (NetCDF, TNG, …) and 11 text formats (XYZ, PDB, …). For the text formats, files compressed with gzip or xz are also supported.

The main differences with OpenBabel is that OpenBabel is oriented toward cheminformatics, and chemfiles is oriented toward computational chemistry, with a better support for trajectories as created by molecular simulation software: binary formats and large files. Other differences include the license (LGPL vs BSD) and the complexity of the interface.

If you are interested in using chemfiles in your code, I can help you to integrate it. Send me an email at luthaf (at) luthaf (dot) fr if you want to discuss this privately, or I am happy to answer here.

Many of the trajectories do not provide information about the elements.

The way I deal with this is by having a separated data structure for the topology, and allowing to set the topology to use for all steps in a trajectory.

Many of the trajectories are huge in size and require sufficient memory to store all of the coordinate sets.

I went the other way around: text formats stores the file offsets of the beginning of the steps, allowing to quickly access a given step when requested. This means that there is no need to load the whole trajectory in memory. But the tradeof might be different for a visualization application, where you might want to have all the steps available.

ghutchis · June 20, 2018, 2:58pm

I can’t speak for @badarsh2 but we’d certainly love to have integration with chemfiles.

One of the big changes in Avogadro2 was the intent to help bring multiple packages together - not just integration with Open Babel. Clearly for trajectories, there’s a lot more support in chemfiles already. So I’ll let you and Adarsh coordinate on implementations.

The way I deal with this is by having a separated data structure for the topology, and allowing to set the topology to use for all steps in a trajectory.

Yes, agreed. I think this is something needed for the project – a way in the interface to load the topology and trajectory separately. (@badarsh2 - I’d consider that users should load the topology first and trajectory with another command or the player tool… but maybe also a command that allows the topology to be ‘replaced’ if the opposite happens accidentally.)

I went the other way around: text formats stores the file offsets of the beginning of the steps, allowing to quickly access a given step when requested. This means that there is no need to load the whole trajectory in memory.

You mention a text file, but perhaps it would suffice to keep a vector of file offsets for each step in memory? That would provide faster access than a text file with offsets, but avoid loading the entire trajectory in memory?

Luthaf · June 20, 2018, 3:21pm

Sorry, I was not clear. I am indeed storing file offset (std::streamoff) in a vector in memory, and not writing a files containing offset. And I definitively agree that this should be faster than writing an offset file =).

But I still need to read through the file (but not parse everything) once to locate these offset when loading the file.

mhanwell · June 21, 2018, 3:42pm

One other question, I have not read through the code, but how do you deal with reading trajectories into our memory layout efficiently? Do we have an intermediary representation from chemfiles that is then translated to our in memory representation, or have you taken some kind of templated approach. We have a meta-reader for Open Babel (which is GPLv2 only), and defer to it for all formats supported. In that case we have it translate to a format we can read/write efficiently, but there is obvious overhead in the translation to a common format.

Definitely interested, and would like to understand the overhead compared to a reader that parses and reads directly into our memory layout.

Luthaf · June 22, 2018, 9:30am

Yes, chemfiles defines an intermediate representation (the Frame class) that Avogadro would need to convert to/from.

Depending on the memory layout used in Avogadro, one could keep the frame alive and take pointers inside the frame to access the positions/velocities without any copies. Doing the same might be harder for the topology.

Having some kind of no-copy API for frame integration with other codes might be a good idea though, but it will require a bit of design work.