Feel free to ask anything
Just type your information below
It's totally free
Graph Neural Networks for
Binding Affinity Prediction

Content
  1. Intro
  2. Binding Affinity
  3. Virtual Screening
  4. Ligand parameterization
  5. Receptor parameterization
  6. Graph neural networks
  7. Example of implementation
  8. Summary
1. Intro

The topic relates to the application of AI and bioinformatics in drug discovery. The core purposes of the proposed AI technology [1] are:

  1. Cut expenses and duration of early drug discovery phases (target discovery and validation, lead identification and optimization). The phases are marked in red on the image below.
  2. Extend virtual screening capabilities and accuracy by rejecting dubious molecular coordinates and transition to a more efficient parameterization of (bio)molecules.
To get a context of the problem and solution, let's first consider the main concepts.
2. Binding Affinity

2.1 What is binding affinity?

Binding affinity is the strength of the binding interaction between a single biomolecule (e.g. protein or DNA) to its ligand/binding partner (e.g. drug or inhibitor).

Binding affinity is typically measured and reported by the equilibrium inhibition constant (Ki), which is used to evaluate and rank order strengths of biomolecular interactions. The smaller the Ki value, the greater the binding affinity of the ligand for its target. The larger the Ki value, the more weakly the target molecule and ligand are attracted to and bind to one another.

Binding affinity is influenced by non-covalent intermolecular interactions such as hydrogen bonding, electrostatic interactions, hydrophobic and Van der Waals forces between the two molecules. In addition, the binding affinity between a ligand and its target molecule may be affected by the presence of other molecules.

2.2 Why is it important?

Whenever you are characterizing proteins, nucleic acids, and any biomolecule, understanding the binding affinity to substrates, inhibitors, and cofactors is key to the appreciation of the intermolecular interactions driving biological processes, structural biology, and structure-function relationships.

In drug discovery, binding affinity is used to rank hits binding to the target and design drugs that bind their targets selectively.

"Selectively" means the drug must have a high affinity to the selected target and the lowest possible affinities to other targets to avoid off-target binding and caused side effects.

2.3 How to measure it?

There are many experimental ways to measure binding affinity and inhibition constants, such as ELISAs, gel-shift assays, pull-down assays, equilibrium dialysis, analytical ultracentrifugation, surface plasmon resonance, and spectroscopic assays.

Experimental methods are expensive in terms of required human efforts, time, and resources. Due to the tremendous number of chemical compounds, experimental bioactivity screening efforts require the aid of computational approaches. A set of such approaches is called virtual screening.

3. Virtual Screening

Let's give a definition and a brief classification for the concept:

Virtual screening is a set of computational techniques for the selection of molecules that are most likely to bind to a drug target (protein or polynucleotide).

There are two main branches of virtual screening:

3.1 Ligand-based

The 3D structure of the target is unknown. A set of geometric rules and/or physical-chemical properties (known as pharmacophore model) obtained by QSAR studies are used for the screening.

3.2 Structure-based

The 3D structure of the target is known. Target coordinates and scoring function are used to calculate the affinity of the target to ligands. Also known as molecular docking (see the gif below).

Docking of a small molecule (green) into the crystal structure of the beta-2 adrenergic G-protein coupled receptor (PDB: 3SN6)
Example of a legacy virtual screening method failure for affinity evaluation

Conventional structure-based virtual screening methods are AutoDock, AutoDock Vina, Vina with modern scoring functions like RF [4].

Recently, our group tried to separate known thrombin active and inactive ligands using AutoDock, AutoDock Vina, and Vina with RF Score. Selected active and inactive molecules were established experimentally (i.e., their activity was known in advance; we rather tested virtual screening methods correspondence to reality). All ligands are available at the DUDE database web interface.

As a result, AutoDock, AutoDock Vina, and Vina RF were not able to separate active (blue bars) and inactive (red bars) ligands even with a properly set binding site.


Graph neural networks as a virtual screening tool

Within the current context, the proposed AI technique is classified as a kind of virtual screening. Let's review how graph neural networks are used for the parameterization of molecules.

4. Ligand parameterization

Ligands are small molecules usually. Thus, the atom/chemical bond scale is appropriate. For large molecules, such as peptides, refer to the next chapter.

4.1 Atom parameterization


Each atom of a ligand is featured with mass, total and partial charges, number of radical electrons (integers); atom type, valence, hybridization, aromaticity, chirality types (one-hot encodings), etc. Atomic coordinates may be used as a feature or not (especially if unknown). Atom features are shown as a vector with three circles in the image below.

4.2 Bond parameterization

Each bond of a ligand is featured with a type (single, double, triple, aromatic), ring affiliation, whether a bond is conjugated (0/1), stereo configuration (cis-/trans-, E/Z, S/R, none), direction (upright/downright) of a bond, etc. Bond features are shown as a vector with two circles in the image below.

4.3 Intramolecular forces

Intramolecular = inside a molecule. Learned by attention mechanism. The mechanism by design should capture electronic density distribution effects within a molecule. Recursively trained attention context vector aggregates information about local neighborhoods to provide expressive representations of small molecules. In higher time steps, target node embedding will include information from further nodes recursively. A more intense (smaller) yellow circle implicates a higher attention level of the neighbor node.

5. Receptor parameterization

Receptors are typically large biomolecules: polynucleotides (DNA, RNA) or proteins. There are different receptor parameterization techniques for graph neural networks. Let's review some of them.

5.1 As a linear graph

A receptor can be represented as a linked list (linear graph) of amino acids (for proteins and peptides) or nucleotides (for DNA, RNA). A node, in this case, is one amino acid or one nucleotide. A node can be parameterized with such features as charge, flexibility (Smith), hydrogen bond donors/acceptors, hydrophobicity, polarity (Zimmerman), Van-der-Waals volume, etc. Edges are mostly the same (peptide bonds for proteins/peptides and alternating sugar-phosphate backbone along the polynucleotide chain) and do not require parameterization. I can only hope that kind of attention mechanism would be able to simulate/learn the secondary/tertiary structure of these biomolecules. Cartesian coordinates are not required.

5.2 As a graph of intermolecular interfaces

A protein-ligand graph is computed from the atomic coordinates in a PDB file. In this graph, vertices represent secondary structure elements (usually alpha helices and beta strands) or ligand molecules while the edges represent contacts and relative orientations between them. Atomic coordinates are a must for the case.
Computation of the protein graph from 3D atom data

5.3 As adjacency matrix
Interatomic (or inter-amino-acid, or inter-nucleic-acid) distances matrix is constructed from the Cartesian atomic coordinates. In these matrices, the rows and columns are assigned to the nodes in the network and the presence of an edge is symbolized by a numerical value.
Graphs by edge type and their adjacency matrices

It is possible to build undirected and directed neighbor matrices without Cartesian coordinates, but they are mandatory to create weighted adjacency matrices where the distance between nodes is taken into account. For weighted molecular adjacency matrices, the radial interaction cutoff (typically 12 Angstrom) is used to truncate nodes that do not interact.

5.4 As a raw FASTA string
Applicable only for NLP models, such as Transformers, GRU, LSTM, etc. In this case, raw receptor FASTA string is used, parameterization is made by the model implicitly by learning embeddings, for example. Coordinates are not required in this case.
Node and edge featurization define the type and architecture of the receiver graph neural network.

6. Graph neural networks

This section provides a short introduction to graph neural networks (GNN) for molecular properties prediction and outlines their categorization.

6.1 Convolutional graph neural network (Conv-GNN)

Convolutional neural networks (CNNs) are networks specialized for interacting with grid-like data, such as a 2D image. As molecules are typically not represented as 2D grids, chemists have focused on a variant of this approach: the Conv-GNN on molecular graphs.

Molecular graphs confer key advantages: they bypass the conformational challenge of using 3D representations while maintaining invariance to rotation and translation due to their pairwise definition. The MoleculeNet paper (Wu et al., 2017) offers a concise conceptual comparison of six major variants. To facilitate the following explanation, the framework of neural message-passing networks put forth (Gilmer et al., 2017) is used.

Neural message passing networks utilize a convolutional layer, simply a matrix of scalar weights, to exchange information between atoms or bonds within a molecule and produce a fixed-length, real-valued vector that embeds the molecular information. To begin, they generate or compute a feature vector for each atom within the molecule. These feature vectors are then collected into a matrix. Additionally, they generate a graph topology matrix that specifies the connectivity of the graph. In a forward convolutional pass, these three matrices are multiplied together. This allows information to be exchanged between the feature vectors of each atom with its immediate neighbors, in accordance with the connectivity specified by the topology matrix. This updates each atom's feature vector to include information about its local environment. This updated feature vector matrix is then passed through an activation function (i.e., ReLU) and can then be iteratively updated by using it as the feature matrix in another convolutional pass. This propagates information throughout the molecule. Finally, these atom feature vectors are either summed or concatenated to give a unique, learned representation of the molecule as a real-valued vector (see the figure below).

The learned representation in vector form is referred to as a representation in latent space and is then used as the input for a traditional fully connected DNN to finally make the classification or prediction. Backpropagation is once again used to train these networks by propagating gradients backward and determining how to change the convolution matrix weights and the parameters in the DNN.

Examples: GAT, GCN, AGNN, SchNet, MPNN

6.2 Recurrent graph neural network (Rec-GNN)

Recurrent neural networks (RNNs), introduced by Hopfield in 1982, are specialized for dealing with sequences of arbitrary length. This makes them ideally suited to handling the textual representation of chemical information, such as FASTA and SMILES. The critical difference is that in the previous architectures each data input is distinct, while in an RNN each input will influence the next one. An illustrative example is viewing any particular input, such as a SMILES string, as time-series data. The presence of a carbon atom at one moment in time influences what the next character is likely to be. This is expressed in the architecture by feeding the output of the hidden layer for that carbon into the hidden layer of the next atom. The feeding of one hidden state into the next gives the system a recursive relationship within the hidden layer, but it can be viewed as directional by "unfolding" the network to form an unfolded, acyclic network graph. By doing this, it maintains a history of all previous inputs, and they influence its prediction at a later time. The network can then be trained using a recursive form of backpropagation.

Examples: AttentiveFP, MGCN, GPNN

Illustration of ACNN and RNN architectures for chemical applications. Colored arrows stemming from the amine group (ACNN) indicate the information transfer from the nitrogen to other heavy atoms, with the color corresponding to the convolutional pass. Light gray arrows indicate each atom's feature vector in the matrix. Importantly, properties such as the atomic number (Z) are often encoded using one-hot vectors, which are binary, but for spatial efficiency, the integer is used in its place. The RNN model shows a simplified "many to one" recurrent network, with the text above and below the dashed lines indicating a stylized affinity prediction system. This system takes in receptor FASTA and ligand SMILES strings and predicts the affinity value.
Difference between recurrent (Rec-GNN) and convolution (Conv-GNN) based graph neural networks: Rec-GNNs apply the same set of weights (W1=W2) until a convergence criterion is met, whereas Conv-GNNs apply different weights at each iteration.

6.3 Special

These GNN architectures have several additions to the basic GNNs (Conv-GNN and Rec-GNN) like different pooling strategies, skip-connections, attention mechanisms, super-nodes, isomorphic graphs.

Examples: Weisfeiler-Lehman, Graph-Attention Transformers

7. Example of implementation

As an example, we will train Atomic Convolutional Neural Network (ACNN) by Gomes et al., 2017. The dataset is PDBBind 2015 — it contains three subsets: core (195 structures), refined (3,706 structures), and full (14,260). The target is to predict ∆G — free energy of receptor-ligand complexation, which serves as a binding affinity metric.

Diagram of Atomic Convolutions on Protein-Ligand Systems. ∆G (complex) = G (complex) − G (protein) − G (ligand)
7.1 Select configuration

See the full list of available configurations here.

idx = 0
choices=[...]
args.update(get_exp_configure(choices[idx]))


7.2 Prepare train and test datasets

dataset, train_set, test_set = load_dataset(args)
args['train_mean'] = train_set.labels_mean.to(args['device'])
args['train_std'] = train_set.labels_std.to(args['device'])
train_loader = DataLoader(dataset=train_set,
batch_size=args['batch_size'],
shuffle=False,
collate_fn=collate)
test_loader = DataLoader(dataset=test_set,
batch_size=args['batch_size'],
shuffle=True,
collate_fn=collate)


7.3 Load ACNN model, initialize loss function, optimizer, and early stopping callback

model = load_model(args)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
stopper = EarlyStopping(mode=args['mode'],
patience=args['patience'],
filename=choices[idx]+'_model.h5')


7.4 Train until early stopping

for epoch in range(args['num_epochs']):
run_a_train_epoch(args, epoch, model, train_loader, loss_fn, optimizer)
test_scores = run_an_eval_epoch(args, model, test_loader)
test_msg = update_msg_from_scores('test results', test_scores)
early_stop = stopper.step(test_scores['mae'], model)

if early_stop:
print('Early stopping')
break


7.5 Results and discussion

With the default ( ACNN_PDBBind_core_pocket_random) configuration, ACNN achieves 1.5006 MAE, 0.1461 R2.

epoch 12/500, training | loss 0.7162, r2 0.2815, mae 1.6027
test results, r2 0.1461, mae 1.5006
In terms of the task, 1.5006 MAE means that the mean average error of the model is ~1.5 kcal/mol of Gibbs free energy of binding affinity. It is precise enough, compared to the energy of water-water hydrogen bond O−H···:O (21 kJ/mol or 5.0 kcal/mol).

See the full implementation on GitHub.

8. Summary

1. Challenges. Predicting drug-target interactions is crucial for novel drug discovery, drug repurposing, and uncovering off-target effects. Experimental bioactivity screening takes significant time (1–3 years) and expense (more than 100 million USD on average per new drug-on-market) but has low efficiency. Bioassays are typically backed by computational methods, but legacy simulations fail to deliver either sufficient precision — like in the example of AutoDock Vina with modern RF Score which failed to separate active and inactive thrombin ligands — or sufficient speed — like in the example of molecular dynamics or first-principle quantum mechanics simulations. As a result, more than 90% of the proposed leads are declined (He et al., 2017).

2. Solution. In silico methods are highly demanded since they can expedite the drug development process by systemically suggesting a new set of candidate molecules promptly, which saves time and reduces the cost of the whole process by up to 43% (DiMasi et al., 2016). Graph neural networks deliver superior accuracy for the task in a matter of milliseconds per receptor-ligand pair and extend docking capabilities by accepting structures without coordinates.

3. Results. The ACNN model easily achieves ~1.5 kcal/mol MAE predicting Gibbs free energy of binding affinity. This repository contains the full implementation.

4. Philosophy. Bio-/cheminformatics is now on the edge of a similar paradigm shift as computer vision before the deep learning model AlexNet won the 2012 ImageNet contest. Instead of selecting manually crafted features for molecules, integrative features are learned by optimization methods.

References
  1. Gurbych O., Druchok M., Yarish D., Garkot S., "High throughput screening with machine learning", presented at NeurIPS 2020
Made on
Tilda