logo

Geometric Deep Learning for Drug–Target Interaction — Nucleate BioHack 2025

Built by Jeremy Wayland during the Nucleate BioHack (Novartis Challenge). This post outlines the dataset and paper we followed, our graph neural network approach, how we integrated geometric and molecular features, links to the code, and an interactive compound visualization.

Paper and Dataset — Context

  • We ingested the Novartis challenge SMILES library, sanitized the molecules with RDKit, stripped to the largest fragment, and optionally expanded with explicit hydrogens before graph construction.
  • Each graph stores a gene-expression vector of log fold changes in data.y plus optional per-sample metadata (dose, platform, etc.) in data.mol_features, allowing us to condition predictions on experiment context.
  • The preprocessing utilities live in gixnn/molecular_features.py and emit ready-to-batch torch_geometric.data.Dataobjects for training and evaluation.

Model — GNN Architecture

  • The core MolecularGCNstacks GCNConv layers when no edge attributes are present and switches to GINEConv once we add curvature-augmented bonds, ensuring the message passing MLP sees both node states and encoded edge channels.
  • Edge attributes are first lifted by a two-layer encoder so the convolutions operate in the model's hidden width. Each layer is followed by batch norm, ReLU/GELU (configurable), and dropout to keep the hackathon training runs stable.
  • After global pooling (mean/add/max), we concatenate optionalmol_features context vectors before a three-layer MLP (fc1–fc3) that produces the gene-expression predictions. A lightweight l2_regularization() helper lets us decay weights without touching bias/normalization parameters.

Features — Geometric + Molecular

  • Node features cover one-hot atom identity across permitted elements plus normalized counts for hydrogens, degree, valence, aromaticity, ring membership, radical electrons, and optional chirality tags pulled directly from RDKit atoms.
  • Edge features include bond type one-hots, conjugation, ring flags, and stereo labels. Everything is duplicated for both directions so PyG operates on a simple undirected COO graph.
  • Forman–Ricci curvature is computed per bond via the SCOTT/KILT curvature filtrations backend and appended as an extra edge channel. The formulation follows the curvature filtrations publication, giving the model geometric sensitivity without needing expensive 3D conformers.
Molecule
Aspirin
O=C(C)Oc1ccccc1C(=O)O

Code and Reproducibility

  • The full preprocessing + model stack lives in the gixnn/ directory of this repository, including configs (gixnn/config.py), RDKit feature builders, curvature utilities, and the MolecularGCN implementation.
  • Experiments are configured through GNNConfig,TrainingConfig, and DataConfig, so you can swap pooling strategies, activations, dropout, curvature toggles, or context feature widths without rewriting the model.
  • Resources below include direct links to this codebase, the curvature-filtrations references, and background reading for anyone reproducing the hackathon runs.

Resources

Program details and challenge tracks, including the Novartis challenge.

Description and access to raw CIGS Data Files used for the Novartis Challenge.

Background on message passing, equivariance, and molecular representations.

Relevant citation (Southern & Wayland) detailing Ricci curvature and its application to graph generative model evaluation.

KILT backend for computing curvature features on molecular graphs.

This repo’s gixnn/ directory with configs, feature builders, curvature utilities, and the MolecularGCN implementation.