COMPOSE: Composing Future Theorems from Citations and Formal Structure

COMPOSE: Composing Future Theorems
from Citations and Formal Structure

David Busbib, Michael Werman

Hebrew University of Jerusalem

arXiv Code

Checkpoint

Mathlib Corpus

Overview of COMPOSE. (Top) Data construction: given an anchor paper and its citation graph, we build a scientific graph of paper and theorem nodes, then align informal theorems to Mathlib4 to construct a formal dependency graph. (Bottom) Model architecture: the scientific and formal graphs are encoded by dedicated GNNs, fused via cross-attention (⊕), and used to condition a DeepSeek-Math-7B decoder that generates a predicted future research claim, which is then used to retrieve the actual future paper from a 47K pool.

Abstract

Anticipating future mathematical claims requires more than extrapolating from prior text: progress is shaped both by evolving scientific ideas and by the formal dependency structures that constrain valid results. Existing approaches capture only one of these signals.

We introduce grounded future mathematical generation — the task of predicting plausible future claims given an anchor paper, by jointly encoding its citation graph and its formal Mathlib theorem structure. We build a dataset of 108K scientific–formal graph pairs from arXiv and Mathlib4 (up to 2023), and a temporally grounded retrieval benchmark of 47K future papers from 2024–2025.

We propose COMPOSE, a dual-graph encoder that fuses citation context and formal theorem structure to condition a DeepSeek-Math-7B decoder. COMPOSE achieves the best retrieval performance, recovering the correct future paper in the top 10 in 50.8% of cases on the confidence-stratified evaluation subset.

Grounded Future Mathematical Generation

Standard future-paper prediction treats the problem as language modelling over text. COMPOSE reframes it: the model is given the citation neighbourhood of an anchor paper plus the Mathlib theorems closest to its topic, and must generate a theorem-level claim that will be proved in a paper published after the training cutoff.

Each training example is a triple (G_s, G_f, y*): the scientific citation subgraph, the Mathlib theorem dependency subgraph, and the abstract of the future paper used as retrieval target. At inference, the generated claim is embedded and used to retrieve from the 47K future-paper pool.

Citation Graph G_s

Up to 30 cited papers, each embedded with E5-large-v2 (1024-dim). A SimpleGNN aggregates over citation edges to produce 1152-dim node representations.

Formal Graph G_f

Top-K Mathlib theorems by cosine similarity to the anchor abstract, expanded via BFS through proof dependencies. Node features are DeepSeek-Math embeddings of Lean theorem statements.

Retrieval Evaluation

Generated claims are embedded with a fine-tuned DeepSeek-Math encoder and retrieved from a 47K pool of papers published in 2024–2025. Primary metric: H@10.

Dataset

We collect 108K paper–Mathlib graph pairs from arXiv (2000–2023). For each paper, theorem statements are matched to Mathlib theorems in two stages: embedding-based retrieval followed by formal-syntax filtering. Papers with at least one high-confidence match use both encoders; the remaining papers are trained with the citation encoder only.

Evaluation follows a strict temporal split. All test papers are from 2024–2025, unseen during training. We keep only anchor papers cited by at least 7 future papers in the pool, giving roughly 2K test queries against a 47K retrieval pool.

108K

scientific–formal graph training pairs

47K

future papers in the retrieval pool

~2K

temporal test queries

180K

Mathlib theorems in the formal corpus

COMPOSE Architecture

Both graph encoders use the same SimpleGNN backbone but differ in node initialization. Enc1 uses E5-large-v2 abstract embeddings; enc2 uses DeepSeek-Math embeddings of Lean theorem statements. After two GNN hops, a Bridge MLP projects node features into a shared 4224-dim latent space.

A single bidirectional cross-attention block lets the scientific graph attend to the formal graph and vice versa. The resulting fused representations are injected as cross-attention keys/values into eight evenly spaced layers of DeepSeek-Math-7B. Only those cross-attention weights are trained — 10.7% of total parameters.

Stage 1 — Graph Pretraining

Train enc1 and enc2 jointly with link prediction (L_link), cross-domain alignment (L_align), and informal↔formal contrastive loss (L_cross). The decoder is frozen throughout.

Stage 2 — Generation Fine-tuning

Cross-attention weights in the decoder are unfrozen. Training combines L_CE (autoregressive generation) with L_margin, which penalises the model for generating the same output regardless of the graph context.

Results

We evaluate on the full test set (47K pool). COMPOSE achieves H@10 = 0.508, compared to 0.410 for CoI-GPT4 and 0.080 for GIANTS. The Gap column measures cosine similarity to the target minus cosine to 500 random negatives.

Retrieval on the full test set (47K pool)

Model	H@10	H@100	Gap
COMPOSE (ours)	0.508	0.808	0.240
CoI-GPT4	0.410	0.770	0.176
Text-only (LoRA)	0.369	0.738	0.177
GoAI (DeepSeek-Math 7B)	0.376	0.680	0.202
Prompt-only (DeepSeek-Math 7B)	0.348	0.697	0.211
GIANTS	0.080	0.329	0.207
Fixed NN (abstracts + main theorem)	0.068	0.392	0.108

H@k retrieval curves (k = 1 to 100) on the confidence-stratified subset. COMPOSE (blue squares) leads at k = 5, 10, and 20.

Ablation Study (confidence-stratified 200-sample subset, 47K pool)

Model	H@10	H@100	Gap
Full graph (ours)	0.750	0.845	0.201
Formal-graph only (no G_s)	0.510	0.810	0.141
w/o stage-1 pretraining	0.240	0.505	0.093
w/o bidirectional fusion	0.195	0.530	0.090
Paper-graph only (no G_f)	0.075	0.260	0.043

Removing G_f drops H@10 from 0.750 to 0.075 — the largest single drop — confirming that Mathlib formal structure is the primary driver. Removing bidirectional fusion drops it to 0.195, showing that cross-encoder interaction is essential.

Example Predictions

We show COMPOSE side-by-side with baselines on two representative examples from the test set. In each case the model receives only the anchor paper title as text; all mathematical context is injected via the citation and Mathlib graphs. Similarity scores are cosine similarity to the ground-truth future paper embedding.

Example 1 — Distributed Nash Equilibrium (arXiv:1212.6235 — game theory / optimization)

Ground-truth future paper

"Noncooperative game-theoretic tools have been increasingly used to study many important resource allocation problems in communications, networking, smart grids, and portfolio optimization. In this paper, we consider a general class of convex Nash Equilibrium Problems (NEPs), where each player aims to solve an arbitrary smooth convex optimization problem."

Model	Generated Output	Sim	H@10
COMPOSE	We propose an algorithm that solves monotone game problems in which players have different local objectives but share some common affine inequalities as well as their individual ones (e.g., capacity limits). Convergence properties under mild assumptions using variational analysis combined with Lyapunov arguments…	0.910	✓ rank 2
Text-only (LoRA)	We prove that the game admits a unique Nash equilibrium, and that this equilibrium is the solution of a certain convex optimization problem… [highest sim, rank 26 — fails]	0.964	✗
GoAI	In an N-coalition non-cooperative game, if the players in the same coalition cooperatively minimize the sum of their individual costs…	0.910	✓ rank 9
Prompt-only	Based on the abstract and related papers, the main theorem could be a result about the convergence of a gradient-based algorithm for distributed Nash equilibrium computation…	0.805	✓

Text-only (LoRA) achieves the highest embedding similarity (0.964) with a concise Nash equilibrium statement, yet fails at rank 26. COMPOSE (0.910, rank 2) generates richer context about monotone games and variational analysis — the broader framework the ground truth actually belongs to — enabling successful retrieval. A decontextualised correct statement is not enough without graph grounding.

Example 2 — Koopman Operator Convergence (arXiv:2206.13659 — dynamical systems)

Ground-truth future paper

"We develop an algebraic framework for sequential data assimilation of partially observed dynamical systems. In this framework, Bayesian data assimilation is embedded in a non-abelian operator algebra… the forecast step of data assimilation is represented by a quantum operation induced by the Koopman operator of the dynamical system."

Model	Generated Output	Sim	H@10
COMPOSE	The authors prove a convergence theorem for extended dynamic mode decomposition applied with finite-rank approximations and show how this relates to approximating eigenfunctions and eigenvalues associated with dynamical systems governed by ergodic measures, using empirical data samples only…	0.934	✓
Text-only (LoRA)	We prove that the eigenvalues of the matrix K_N^M approximate the eigenvalues of K^M_N when M → ∞. [higher sim, rank 34 — fails]	0.945	✗
Prompt-only	Based on the provided abstract and related papers, it seems that the main focus is on using Koopman operator theory to classify and compare iterative algorithms…	0.708	✗
GoAI	Let K be the Koopman operator acting on the space of observables L²(μ), where μ is an ergodic measure. Then, the eigenvalues of K are contained in the unit circle…	0.402	✗

COMPOSE (sim 0.934, rank 1) generates a precise description of EDMD convergence with eigenfunctions and ergodic measures — directly matching the ground truth's Koopman operator framework. Notably, Text-only (LoRA) achieves higher embedding similarity (0.945) with a narrow eigenvalue statement, yet fails to retrieve (rank 34) — showing that a correct but decontextualised statement is insufficient without graph conditioning.

Limitations

Our alignment pipeline finds high-confidence Mathlib matches for only 16.9% of paper theorems; the remaining anchor papers run with the citation encoder only. Evaluation is bounded by the 47K pool, so novel work outside that window is invisible to our retrieval metric.

All experiments are on mathematics. Whether the approach transfers to other domains with formal libraries (Coq, Isabelle) remains open. We also do not verify formal provability of generated theorems — connecting generation with automated verification is a natural next step.

BibTeX

@article{busbib2026compose,
  title   = {{COMPOSE}: Composing Future Theorems from Citations and Formal Structure},
  author  = {Busbib, David and Werman, Michael},
  journal = {arXiv preprint},
  year    = {2026}
}