MGS Assistant

Quick Summaries

Hello! I'm the research assistant for the MGS paper. You can ask me about Reasoning Rigidity, Gradient Conflicts, or how we resolved them at the module level.
Powered by Gemini 2.5 Flash

Advancing General-Purpose Reasoning Models with
Modular Gradient Surgery

Min Cai1,2* Yu Liang1 Longzheng Wang1 Yan Wang1 Yueyang Zhang1 Long Xia1 Zhiyuan Sun1 Xi Ye2 Daiting Shi1
1 Baidu Inc.
2 University of Alberta
* Work done during CM's internship at Baidu

Abstract

"Training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. We introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer."

+4.5 pts

Avg. Improvement over standard Multi-task RL

MGS Performance Frontier

Effectiveness Comparison: MGS achieves stronger and more balanced performance than Sequential RL, Naïve Mixing, and Global Gradient Surgery.

Strategy Analysis 1: Sequential RL

Training domains sequentially triggers Mode Interference, creating a performance ceiling through two primary failure modes.

Analysis Plot

Forgetting

Math-after-Chat triggers an asymmetric -7.0 drop in conversational capability.

Rigidity

Low-entropy Math updates "lock" exploratory behavior, creating a -9.6 gap.

Entropy Analysis

Entropy & Capability Trade-off

Sequential trajectories fail to reach expert performance levels due to early exploratory collapse (Math) or subsequent capability erasure (Chat).

Strategy Analysis 2: Mixed RL

Mixing data in batches causes interference at the Gradient Level, where updates for one task directly harm the other.

Conflict Monitor
Log Scale Analysis
Gradient Cosine Similarity

Negative Cosine Similarity Detected

Math updates frequently exhibit negative alignment with Chat updates, creating optimization friction that persists throughout the training trajectory.

Pareto Frontier Analysis
Key Insight

The Mixing Paradox

Empirical results show that even with a highly skewed mixture (90% Math, 10% Chat), the model consistently lags behind a specialized Math-only expert.

Conclusion Simply adjusting data proportions is insufficient to eliminate cross-domain interference—the conflict is fundamental to the shared gradient space.

Global Constraints Standard "Global" surgery (PCGrad) is often overly conservative. By treating the model as a monolithic block, projecting the entire gradient based on a conflict in a single module stalls learning across the whole system.

The Solution

Modular
Gradient Surgery

Instead of treating the model as a monolithic block, MGS partitions parameters into functional modules (MLPs, Attention) and applies Gradient Surgery (PCGrad) independently.

  • Local Resolution: Prevents one module's conflict from stalling the whole model.
  • Functional Specialization: Preserves domain-specific knowledge in MLPs.
  • FSDP Compatible: Overhead is negligible (~0.1% training time).
mgs_algorithm.py

def apply_mgs(gradients, modules):

  for task_a, task_b in permutations(tasks, 2):

    # Iterate per module

    for mod in modules:

      g_a = gradients[task_a][mod]

      g_b = gradients[task_b][mod]

      if dot(g_a, g_b) < 0:

        project(g_a, onto=g_b_orthogonal_plane)

Main Evaluation Results

Comparison of post-training strategies across diverse benchmarks.

Math (pass@1, avg@4) Chat Misc.
Backbone Method MATH GSM AIME M. Avg. WB AE2 CW3 C. Avg. IF Gen. Total Avg.
Prior Works
Llama RLMT 4.9 10.2 0.0 5.0 35.6 15.6 36.3 29.2 20.4 21.5 18.9
Qwen RLMT 71.8 82.7 10.0 54.8 44.7 57.2 41.7 47.9 25.9 35.4 42.4
General Reasoner 69.4 81.5 16.7 55.9 13.3 38.8 39.2 30.4 25.1 36.7 38.1
Qwen-Inst OpenThinker v3 78.8 87.7 29.2 65.2 -50.2 19.3 29.7 -0.4 24.4 39.1 32.8
Multi-task Baselines
Qwen Normalized Adv. 76.1 91.2 9.2 58.8 43.8 44.8 39.0 42.5 27.3 34.2 41.9
Model Merging 74.8 90.6 11.7 59.6 33.0 12.4 -24.8 6.9 24.8 32.9 31.4
This Work
Llama Math → Chat 24.6 71.6 0.0 32.1 25.7 29.0 38.3 31.0 21.4 25.0 27.9
Chat → Math 27.6 78.8 1.7 36.0 -35.0 6.5 28.3 -0.1 21.3 28.8 21.5
Naïve Mixing 25.8 76.6 0.0 34.1 15.0 21.5 33.8 23.4 20.7 23.7 25.9
Global Surgery 29.8 80.4 1.7 37.3 2.3 48.3 34.9 28.5 22.9 29.4 30.1
Modular Surgery 26.0 73.6 0.8 33.5 24.7 39.6 34.1 32.8 23.0 29.3 30.2
Qwen Math → Chat 75.9 90.8 13.3 60.0 35.7 35.4 43.9 38.3 24.8 30.8 39.7
Chat → Math 78.2 92.2 10.8 60.4 36.0 46.8 40.1 41.0 24.4 36.5 42.0
Naïve Mixing 76.0 89.8 10.8 58.9 47.1 26.2 40.7 38.0 22.1 36.5 40.4
Global Surgery 73.0 88.8 14.2 58.7 45.4 62.5 40.2 49.4 25.6 36.8 44.2
Modular Surgery 75.1 90.6 14.2 60.0 48.3 59.3 44.2 50.6 28.2 35.3 44.9
Modular Surgery (Ours)
Bold: Highest Score
Underline: Second Best
SOTA Multi-task Performance

MGS consistently achieves the highest Total Average across both Llama and Qwen families, outperforming strong baselines like Global Surgery and Normalized Advantage.

Harmonized Task Balance

While sequential methods trade off one domain for another, MGS maintains high performance in Math while reaching state-of-the-art Chat and IF scores.

Extended Results

MGS scales with more tasks and training steps.

Llama-3.1-8B Results on Mixing Three Tasks

Method Math Chat IF Gen Total Avg
Naïve Mixing 35.0 22.9 25.4 25.2 27.3
Global Surgery 35.8 24.2 31.1 30.3 30.3
Mod. Surgery 36.0 30.5 30.0 33.1 32.6
Compute Trajectory (x → 2x) Compute Trajectory

MGS prevents capability decay, maintaining a linear growth curve during extended RL.

Citation

@misc{cai2026advancinggeneralpurposereasoningmodels,
      title={Advancing General-Purpose Reasoning Models with Modular Gradient Surgery}, 
      author={Min Cai and Yu Liang and Longzheng Wang and Yan Wang and Yueyang Zhang and Long Xia and Zhiyuan Sun and Xi Ye and Daiting Shi},
      year={2026},
      eprint={2602.02301},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.02301}, 
}