Abstract
"Training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. We introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer."
Avg. Improvement over standard Multi-task RL
Effectiveness Comparison: MGS achieves stronger and more balanced performance than Sequential RL, Naïve Mixing, and Global Gradient Surgery.
Strategy Analysis 1: Sequential RL
Training domains sequentially triggers Mode Interference, creating a performance ceiling through two primary failure modes.
Forgetting
Math-after-Chat triggers an asymmetric -7.0 drop in conversational capability.
Rigidity
Low-entropy Math updates "lock" exploratory behavior, creating a -9.6 gap.
Entropy & Capability Trade-off
Sequential trajectories fail to reach expert performance levels due to early exploratory collapse (Math) or subsequent capability erasure (Chat).
Strategy Analysis 2: Mixed RL
Mixing data in batches causes interference at the Gradient Level, where updates for one task directly harm the other.
Negative Cosine Similarity Detected
Math updates frequently exhibit negative alignment with Chat updates, creating optimization friction that persists throughout the training trajectory.
The Mixing Paradox
Empirical results show that even with a highly skewed mixture (90% Math, 10% Chat), the model consistently lags behind a specialized Math-only expert.
Conclusion Simply adjusting data proportions is insufficient to eliminate cross-domain interference—the conflict is fundamental to the shared gradient space.
Global Constraints Standard "Global" surgery (PCGrad) is often overly conservative. By treating the model as a monolithic block, projecting the entire gradient based on a conflict in a single module stalls learning across the whole system.
Modular
Gradient Surgery
Instead of treating the model as a monolithic block, MGS partitions parameters into functional modules (MLPs, Attention) and applies Gradient Surgery (PCGrad) independently.
- Local Resolution: Prevents one module's conflict from stalling the whole model.
- Functional Specialization: Preserves domain-specific knowledge in MLPs.
- FSDP Compatible: Overhead is negligible (~0.1% training time).
def apply_mgs(gradients, modules):
for task_a, task_b in permutations(tasks, 2):
# Iterate per module
for mod in modules:
g_a = gradients[task_a][mod]
g_b = gradients[task_b][mod]
if dot(g_a, g_b) < 0:
project(g_a, onto=g_b_orthogonal_plane)
Main Evaluation Results
Comparison of post-training strategies across diverse benchmarks.
| Math (pass@1, avg@4) | Chat | Misc. | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Backbone | Method | MATH | GSM | AIME | M. Avg. | WB | AE2 | CW3 | C. Avg. | IF | Gen. | Total Avg. |
| Prior Works | ||||||||||||
| Llama | RLMT | 4.9 | 10.2 | 0.0 | 5.0 | 35.6 | 15.6 | 36.3 | 29.2 | 20.4 | 21.5 | 18.9 |
| Qwen | RLMT | 71.8 | 82.7 | 10.0 | 54.8 | 44.7 | 57.2 | 41.7 | 47.9 | 25.9 | 35.4 | 42.4 |
| General Reasoner | 69.4 | 81.5 | 16.7 | 55.9 | 13.3 | 38.8 | 39.2 | 30.4 | 25.1 | 36.7 | 38.1 | |
| Qwen-Inst | OpenThinker v3 | 78.8 | 87.7 | 29.2 | 65.2 | -50.2 | 19.3 | 29.7 | -0.4 | 24.4 | 39.1 | 32.8 |
| Multi-task Baselines | ||||||||||||
| Qwen | Normalized Adv. | 76.1 | 91.2 | 9.2 | 58.8 | 43.8 | 44.8 | 39.0 | 42.5 | 27.3 | 34.2 | 41.9 |
| Model Merging | 74.8 | 90.6 | 11.7 | 59.6 | 33.0 | 12.4 | -24.8 | 6.9 | 24.8 | 32.9 | 31.4 | |
| This Work | ||||||||||||
| Llama | Math → Chat | 24.6 | 71.6 | 0.0 | 32.1 | 25.7 | 29.0 | 38.3 | 31.0 | 21.4 | 25.0 | 27.9 |
| Chat → Math | 27.6 | 78.8 | 1.7 | 36.0 | -35.0 | 6.5 | 28.3 | -0.1 | 21.3 | 28.8 | 21.5 | |
| Naïve Mixing | 25.8 | 76.6 | 0.0 | 34.1 | 15.0 | 21.5 | 33.8 | 23.4 | 20.7 | 23.7 | 25.9 | |
| Global Surgery | 29.8 | 80.4 | 1.7 | 37.3 | 2.3 | 48.3 | 34.9 | 28.5 | 22.9 | 29.4 | 30.1 | |
| Modular Surgery | 26.0 | 73.6 | 0.8 | 33.5 | 24.7 | 39.6 | 34.1 | 32.8 | 23.0 | 29.3 | 30.2 | |
| Qwen | Math → Chat | 75.9 | 90.8 | 13.3 | 60.0 | 35.7 | 35.4 | 43.9 | 38.3 | 24.8 | 30.8 | 39.7 |
| Chat → Math | 78.2 | 92.2 | 10.8 | 60.4 | 36.0 | 46.8 | 40.1 | 41.0 | 24.4 | 36.5 | 42.0 | |
| Naïve Mixing | 76.0 | 89.8 | 10.8 | 58.9 | 47.1 | 26.2 | 40.7 | 38.0 | 22.1 | 36.5 | 40.4 | |
| Global Surgery | 73.0 | 88.8 | 14.2 | 58.7 | 45.4 | 62.5 | 40.2 | 49.4 | 25.6 | 36.8 | 44.2 | |
| Modular Surgery | 75.1 | 90.6 | 14.2 | 60.0 | 48.3 | 59.3 | 44.2 | 50.6 | 28.2 | 35.3 | 44.9 | |
SOTA Multi-task Performance
MGS consistently achieves the highest Total Average across both Llama and Qwen families, outperforming strong baselines like Global Surgery and Normalized Advantage.
Harmonized Task Balance
While sequential methods trade off one domain for another, MGS maintains high performance in Math while reaching state-of-the-art Chat and IF scores.
Extended Results
MGS scales with more tasks and training steps.
Llama-3.1-8B Results on Mixing Three Tasks
| Method | Math | Chat | IF | Gen | Total Avg |
|---|---|---|---|---|---|
| Naïve Mixing | 35.0 | 22.9 | 25.4 | 25.2 | 27.3 |
| Global Surgery | 35.8 | 24.2 | 31.1 | 30.3 | 30.3 |
| Mod. Surgery | 36.0 | 30.5 | 30.0 | 33.1 | 32.6 |
MGS prevents capability decay, maintaining a linear growth curve during extended RL.
Citation
@misc{cai2026advancinggeneralpurposereasoningmodels,
title={Advancing General-Purpose Reasoning Models with Modular Gradient Surgery},
author={Min Cai and Yu Liang and Longzheng Wang and Yan Wang and Yueyang Zhang and Long Xia and Zhiyuan Sun and Xi Ye and Daiting Shi},
year={2026},
eprint={2602.02301},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.02301},
}