Geometric Deep Learning for Drug–Target Interaction — Nucleate BioHack 2025
Built by Jeremy Wayland during the Nucleate BioHack (Novartis Challenge). This post outlines the dataset and paper we followed, our graph neural network approach, how we integrated geometric and molecular features, links to the code, and an interactive compound visualization.
Paper and Dataset — Context
- We ingested the Novartis challenge SMILES library, sanitized the molecules with RDKit, stripped to the largest fragment, and optionally expanded with explicit hydrogens before graph construction.
- Each graph stores a gene-expression vector of log fold changes in
data.yplus optional per-sample metadata (dose, platform, etc.) indata.mol_features, allowing us to condition predictions on experiment context. - The preprocessing utilities live in gixnn/molecular_features.py and emit ready-to-batch
torch_geometric.data.Dataobjects for training and evaluation.
Model — GNN Architecture
- The core MolecularGCNstacks
GCNConvlayers when no edge attributes are present and switches toGINEConvonce we add curvature-augmented bonds, ensuring the message passing MLP sees both node states and encoded edge channels. - Edge attributes are first lifted by a two-layer encoder so the convolutions operate in the model's hidden width. Each layer is followed by batch norm, ReLU/GELU (configurable), and dropout to keep the hackathon training runs stable.
- After global pooling (mean/add/max), we concatenate optional
mol_featurescontext vectors before a three-layer MLP (fc1–fc3) that produces the gene-expression predictions. A lightweightl2_regularization()helper lets us decay weights without touching bias/normalization parameters.
Features — Geometric + Molecular
- Node features cover one-hot atom identity across permitted elements plus normalized counts for hydrogens, degree, valence, aromaticity, ring membership, radical electrons, and optional chirality tags pulled directly from RDKit atoms.
- Edge features include bond type one-hots, conjugation, ring flags, and stereo labels. Everything is duplicated for both directions so PyG operates on a simple undirected COO graph.
- Forman–Ricci curvature is computed per bond via the SCOTT/KILT curvature filtrations backend and appended as an extra edge channel. The formulation follows the curvature filtrations publication, giving the model geometric sensitivity without needing expensive 3D conformers.
Code and Reproducibility
- The full preprocessing + model stack lives in the gixnn/ directory of this repository, including configs (
gixnn/config.py), RDKit feature builders, curvature utilities, and theMolecularGCNimplementation. - Experiments are configured through
GNNConfig,TrainingConfig, andDataConfig, so you can swap pooling strategies, activations, dropout, curvature toggles, or context feature widths without rewriting the model. - Resources below include direct links to this codebase, the curvature-filtrations references, and background reading for anyone reproducing the hackathon runs.