Modular Gradient Surgery (MGS) - Advancing General-Purpose Reasoning Models

Abstract

"Training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. We introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer."

+4.5 pts

Avg. Improvement over standard Multi-task RL

Effectiveness Comparison: MGS achieves stronger and more balanced performance than Sequential RL, Naïve Mixing, and Global Gradient Surgery.

Strategy Analysis 1: Sequential RL

Training domains sequentially triggers Mode Interference, creating a performance ceiling through two primary failure modes.

Forgetting

Math-after-Chat triggers an asymmetric -7.0 drop in conversational capability.

Rigidity

Low-entropy Math updates "lock" exploratory behavior, creating a -9.6 gap.

Entropy & Capability Trade-off

Sequential trajectories fail to reach expert performance levels due to early exploratory collapse (Math) or subsequent capability erasure (Chat).

Strategy Analysis 2: Mixed RL

Mixing data in batches causes interference at the Gradient Level, where updates for one task directly harm the other.

Conflict Monitor

Log Scale Analysis

Negative Cosine Similarity Detected

Math updates frequently exhibit negative alignment with Chat updates, creating optimization friction that persists throughout the training trajectory.

Key Insight

The Mixing Paradox

Empirical results show that even with a highly skewed mixture (90% Math, 10% Chat), the model consistently lags behind a specialized Math-only expert.

Conclusion Simply adjusting data proportions is insufficient to eliminate cross-domain interference—the conflict is fundamental to the shared gradient space.

Global Constraints Standard "Global" surgery (PCGrad) is often overly conservative. By treating the model as a monolithic block, projecting the entire gradient based on a conflict in a single module stalls learning across the whole system.

The Solution

Modular
Gradient Surgery

Instead of treating the model as a monolithic block, MGS partitions parameters into functional modules (MLPs, Attention) and applies Gradient Surgery (PCGrad) independently.

Local Resolution: Prevents one module's conflict from stalling the whole model.
Functional Specialization: Preserves domain-specific knowledge in MLPs.
FSDP Compatible: Overhead is negligible (~0.1% training time).

mgs_algorithm.py

def apply_mgs(gradients, modules):

for task_a, task_b in permutations(tasks, 2):

# Iterate per module

for mod in modules:

g_a = gradients[task_a][mod]

g_b = gradients[task_b][mod]

if dot(g_a, g_b) < 0:

project(g_a, onto=g_b_orthogonal_plane)

Main Evaluation Results

Comparison of post-training strategies across diverse benchmarks.

		Math (pass@1, avg@4)				Chat				Misc.
Backbone	Method	MATH	GSM	AIME	M. Avg.	WB	AE2	CW3	C. Avg.	IF	Gen.	Total Avg.
Prior Works
Llama	RLMT	4.9	10.2	0.0	5.0	35.6	15.6	36.3	29.2	20.4	21.5	18.9
Qwen	RLMT	71.8	82.7	10.0	54.8	44.7	57.2	41.7	47.9	25.9	35.4	42.4
Qwen	General Reasoner	69.4	81.5	16.7	55.9	13.3	38.8	39.2	30.4	25.1	36.7	38.1
Qwen-Inst	OpenThinker v3	78.8	87.7	29.2	65.2	-50.2	19.3	29.7	-0.4	24.4	39.1	32.8
Multi-task Baselines
Qwen	Normalized Adv.	76.1	91.2	9.2	58.8	43.8	44.8	39.0	42.5	27.3	34.2	41.9
Qwen	Model Merging	74.8	90.6	11.7	59.6	33.0	12.4	-24.8	6.9	24.8	32.9	31.4
This Work
Llama	Math → Chat	24.6	71.6	0.0	32.1	25.7	29.0	38.3	31.0	21.4	25.0	27.9
	Chat → Math	27.6	78.8	1.7	36.0	-35.0	6.5	28.3	-0.1	21.3	28.8	21.5
	Naïve Mixing	25.8	76.6	0.0	34.1	15.0	21.5	33.8	23.4	20.7	23.7	25.9
	Global Surgery	29.8	80.4	1.7	37.3	2.3	48.3	34.9	28.5	22.9	29.4	30.1
	Modular Surgery	26.0	73.6	0.8	33.5	24.7	39.6	34.1	32.8	23.0	29.3	30.2
Qwen	Math → Chat	75.9	90.8	13.3	60.0	35.7	35.4	43.9	38.3	24.8	30.8	39.7
	Chat → Math	78.2	92.2	10.8	60.4	36.0	46.8	40.1	41.0	24.4	36.5	42.0
	Naïve Mixing	76.0	89.8	10.8	58.9	47.1	26.2	40.7	38.0	22.1	36.5	40.4
	Global Surgery	73.0	88.8	14.2	58.7	45.4	62.5	40.2	49.4	25.6	36.8	44.2
	Modular Surgery	75.1	90.6	14.2	60.0	48.3	59.3	44.2	50.6	28.2	35.3	44.9

Modular Surgery (Ours)

Bold: Highest Score

Underline: Second Best

SOTA Multi-task Performance

MGS consistently achieves the highest Total Average across both Llama and Qwen families, outperforming strong baselines like Global Surgery and Normalized Advantage.

Harmonized Task Balance

While sequential methods trade off one domain for another, MGS maintains high performance in Math while reaching state-of-the-art Chat and IF scores.

Extended Results

MGS scales with more tasks and training steps.

Llama-3.1-8B Results on Mixing Three Tasks

Method	Math	Chat	IF	Gen	Total Avg
Naïve Mixing	35.0	22.9	25.4	25.2	27.3
Global Surgery	35.8	24.2	31.1	30.3	30.3
Mod. Surgery	36.0	30.5	30.0	33.1	32.6

Compute Trajectory (x → 2x)

MGS prevents capability decay, maintaining a linear growth curve during extended RL.

Citation

@misc{cai2026advancinggeneralpurposereasoningmodels,
      title={Advancing General-Purpose Reasoning Models with Modular Gradient Surgery}, 
      author={Min Cai and Yu Liang and Longzheng Wang and Yan Wang and Yueyang Zhang and Long Xia and Zhiyuan Sun and Xi Ye and Daiting Shi},
      year={2026},
      eprint={2602.02301},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.02301}, 
}

MGS Assistant