Research, Cheminformatics or Statistics to enhance functionality of Avogadro

So, In avogadro can we use some kind of research or models to enhance its features
For example consider following features (I am not sure if any of the features are already a part of avogadro…so am just suggesting these ideas) :

  • Database Integration
    We could integrate avogadro with chemical databases (e.g., PubChem, ChEMBL) to fetch molecular structures, properties, and activities.

  • Separate Research Section
    Having separate research section where researchers can show if they have used avogadro in their own research …even newbies can learn how to use avogadro in better efficient way as mentioned here.

  • Descriptor Calculation
    We can implement descriptor calculation methods (e.g., molecular weight, LogP, etc.) using cheminformatics and allied libraries to provide users with more detailed info about the molecular properties.

  • Similarity Search
    We can recommend the users to help them find compounds with similar structures or properties.

  • Property Prediction
    We could train machine learning models to predict molecular properties (e.g., solubility, toxicity) based on structural features.

  • QSAR models
    I recently heard about QSAR models which can be used to predict biological activities of the chemical structures. So, we could even try to implement this.

  • Molecular Conformation Analysis
    We could implement ml models such that the user uploads or draws the molecular structure they want to analyze within Avogadro, and Avogadro provides the user with the predicted energetically favorable conformations of the input molecule.
    This could help in analyzing the spatial arrangements of the molecule.

I am just curious if we can implement any of these in avogadro…These are just some use cases and suggestions which I could think of…Am not sure if we want to or can do these.

Thank You

1 Like

I have an open feature request to track this for a while. What would be useful is to consider common features so some could be turned into Python scripts (e.g. “search for query” => properties, “download molecule”, etc.)

Some of this already happens (molecular weight) but other properties can already be calculated with scripts.

Same. I think a few people already are working on command scripts for this.

So, Can I work on integrating the databases ? Do you have any references according to which I can integrate some of the mentioned databases. For eg we can use api for materials project database to integrate…but in which section of the codebase should I implement it?

Can you please tell me more on this. I want to make sure I fully understand. Could you please elaborate a bit more or provide references if any?

Thanks

At the moment, it would be a bit like ImportPQR: avogadrolibs/avogadro/qtplugins/importpqr at master · OpenChemistry/avogadrolibs · GitHub

But I think the better thing would be to consider whether “access a remote database” can be a general thing. For example:

  • What is the URL to access the API?
  • What is the URL to get results for a query (e.g. for PubChem, it might be PUG REST - PubChem or through PubChemPy)
  • A method to turn results from the query into a CSV / TSV / dataframe to display to the user (e.g., what are the columns for properties?)
  • A method to get the URL to download a particular entry / structure
  • Maybe there are some other example queries / needed functions? PQR offers preview images, which are nice.

I haven’t looked at the Materials Project, but between PubChem, ChEMBL and Materials Project, it would be useful to see if we can find similarities so we’re not writing custom code for each database.

In other words, for database integration, I’d like to see if we can design general code that just needs a small amount of information to access a few different databases.

Hello,
Sorry for late reply…yesterday I was getting myself familiar with pubchem and chembl databases so that we could find some common ground for integration.

Could you please tell are we planning to integrate only certain resources from the databases or we would try to integrate the whole database.
This is because in chembl database specifically there are different columns or fields for different resources. If we integrate the whole database then we will have to match the resources of pubchem and chembl (Chembl supports more number of resources perhaps than that of pubchem) in order to find common fields.

Also how would the user ask for the compounds or the resources ie how would the searching take place like using ids, names or structure (structural search is available for both the databases but I think only for certain compounds)?

Yes, this I think could easily be done probably for both the databases, PUBchem API supports CSV and JSON whereas ChemBL supports JSON which we could convert to a dataframe using pandas.

Can we go stepwise…because I felt that it would be a bit tedious job to find similarities in such complex databases as compared to writing custom codes…it would be time consuming but an easier task…(Actually It was getting overwhelming for me :sweat_smile: :sweat_smile: :grimacing: but yes we could definitely try to find similarities)
(Also ,can we have a chatbot or an interface using which the user can query different databases using the same query or search)

Sure. Let’s take PubChem => JSON first.

First step (always move stepwise) … imagine the user searches for “ibuprofen” or “aspirin.” What is the query URL that Avogadro should use to get results from the database?

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/ibuprofen/json

Most important: CID: 3672

We have a bunch of properties we might put into a table:

  • name
  • mass
  • formula
  • logP
  • polar surface area
    … etc.

So we might want a result table that shows these as columns to the user.

Second step - previews:

This might be a nice thing: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/glucose/PNG


Third Step - what if we get multiple entries:

PUG suggests there are multiple glucose entries (this seems to have been cleaned up)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/glucose/cids/TXT

We can then iterate through multiple matches to get properties:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5793/json

I’m not currently able to find a search that returns multiple CIDs … can you?

Hii…just confirming certain aspects

So currently we are focusing on integrating only the compounds of PUBchem

Okay, so we would be quering the db using the name of the compound right?

Probably PUBChem has not updated its docs but has fixed its api.
for example have a look at these (and this link for more info):

though both have different cid but similar names, on quering using the api url we fortunately get the same cid
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/Enterocura/cids
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/Sulfaguanol/cids

So i think we can fortunately query using names for pubchem database.
Now ,tell me how should I start ie should we have a python script or something similar to import PQR in c++

Thanks!

I think my first question is whether the current PubChem REST API will ever return multiple CID for a name search. The docs say yes, but it seems as if it’s now only returning one standardized reply. For example, your search - the first entry CID 65756 has unspecified stereochemistry, while a search for any of the names now only returns CID 9571041.

Personally, I’d start with Python. As I said, let’s think about how to parse the JSON results into a set of properties.

I’d also think about other possible searches. Is there a good way to detect when a user enters a SMILES instead of a compound name?

  • Certainly seeing “#” or “=” in the query, also “@” or “@@” … any other obvious clues?
  • Compound names can include [ or ] ( ) and numbers … also a few special characters

Hello,

Can we use rdkit for validating smile strings or we could probably use a regex(may or may not be foolproof).
(Or we could just let the user choose the input method ie smile or compound name…but here we might need a separate py script…in order to render separate options)
for eg

def getOptions(opts):
    userOptions = {}
    userOptions['input_type'] = {}
    userOptions['input_type']['label'] = 'choose smile or compound name'
    userOptions['input_type']['type'] = 'string'
    userOptions['inpuy_type']['values'] =['SMILE',' compound name']
    userOptions['input_type']['default'] = "compound name"

    if input_type == 'smiles':......
    elif .........

Yeah, I’m kinda leaning towards that. It would be nice to have a “smart” search bar, but I think that’s going to be tricky to be foolproof at identifying SMILES. As a human you know it, but it’s probably better to have a way of selecting different queries.

Again, I guess my question is whether the current PubChem API would return multiple CIDs matching a search. Most of the common searches seem to just return one.

I don’t think that should happen actually,…I even tried to query the above examples using a smile string…only one cid is being returned…here also it says single result would be returned.

Merry Christmas!
Happy and Prosperous New Year to all!

Is it compulsory for us to use the PUBChem API? I discovered that properties and image retrieval would become very easy using PubChemPy library. I felt I made things complicated unnecessarily.
Have a look at this for example for smiles:

   if input_type=='SMILE':
            properties = pc.get_properties(identifier=search_compound,namespace='smiles',properties=props)
            if properties==[]:
                return("The input record was not found")
            try:
                preview=requests.get(base_url+'/smiles/'+search_compound+'/PNG')
                encoded_preview = base64.b64encode(preview.content).decode('utf-8')
                json_data = json.dumps(encoded_preview)
                properties.append({'preview':json_data})           
            except:            
                properties.append({'preview':"Image Not Found"})           
            return properties

We could also try to implement substructure or superstructure search using the library.
I just wanted your confirmation for the image or preview field since if we do not encode it , then the terminal states that its not JSON serializable …so is the above code okay or any other approach for the preview field?

@ghutchis I know you are off a few days, but when you’ll resume I think I would be a bit occupied with my end sem exams (till 22nd Jan), Hence I posted this today.
Kindly answer at your convenience when you resume …Till then HAPPY HOLIDAYS!

Thanks

I think for previews, I’d just pass a URL to Avogadro and have it fetch / handle the image. That way it also knows the image type too.

Have a look at this script draft. Do give me the feedback I’ll make the required changes and send a pr soon if everthing is apt.
Thanks

Hello,
Feels great to be back :grinning:
@ghutchis Can you please let me know the changes required for the pubchem integration plugin…so that I can send a pr to avogadro repo and proceed with other databases.

Also, Can you please review the rotation and reflection script…has been quite a long time the issue is up…I can get on with other transformation types mentioned here if they are apt.
Thanks

Hello,
Is there any way by which we can ask user to input its choice twice within a python plugin?
This is because in CHEMBL database names of the compounds arent unique so the plugin would return multiple results (multiple different compounds).
So, I am thinking of returning all of the relevant compounds to user, from which the user can choose which compound’s properties are required.
I don’t think its possible to integrate (django or flasks) APIs for this purpose …is it??

Also, is there any way by which the user can directly load that specific compound in avogadro’s window?
In pubchem plugin I had added an option to download the SDF file of the compound but I feel it would be nice if the user could directly have it in the avogadro window directly? What are your views on this?