Describe the bug
When training a multimodal/MiMo model with Pipeline Parallelism (heterogeneous Vision Encoder + LLM stages in mcore-v0.17.0) and features like MoE or MTP, the training crashes during the forward step. @mcore-oncall
The error happens in megatron/core/pipeline_parallel/schedules.py inside the forward_step_calc_loss function. The code assumes output_tensor is always a torch.Tensor and directly accesses output_tensor.device to set up the loss scale for MoE auxiliary loss or MTP loss. However, for multimodal pipelines (like Vision Encoders), output_tensor can be a dictionary mapping module names to tensors.
Cause errors such like:
File "/Megatron-LM/mllm/train.py", line 590, in run
losses = schedule.forward_backward_pipelining_without_interleaving(...)
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 2209, in forward_backward_pipelining_without_interleaving
output_tensor, num_tokens = forward_step(...)
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 433, in forward_step
output_tensor, num_tokens = forward_step_calc_loss(...)
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 290, in forward_step_calc_loss
else torch.ones(1, device=output_tensor.device)
AttributeError: 'dict' object has no attribute 'device'
Print the output_tensor:
output_tensor in forward_step_calc_loss: <class 'dict'>, dict with keys dict_keys(['images'])
output_tensor in forward_step_calc_loss: <class 'torch.Tensor'>, 153.93899536132812
Steps/Code to reproduce bug
Follow the test code Megatron-LM/tests/unit_tests/models/test_mimo_1f1b_schedule.py and add expert_parallel degree can reproduce it.
Describe the bug
When training a multimodal/MiMo model with Pipeline Parallelism (heterogeneous Vision Encoder + LLM stages in mcore-v0.17.0) and features like MoE or MTP, the training crashes during the forward step. @mcore-oncall
The error happens in
megatron/core/pipeline_parallel/schedules.pyinside theforward_step_calc_lossfunction. The code assumes output_tensor is always atorch.Tensorand directly accessesoutput_tensor.deviceto set up the loss scale for MoE auxiliary loss or MTP loss. However, for multimodal pipelines (like Vision Encoders),output_tensorcan be a dictionary mapping module names to tensors.Cause errors such like:
Print the
output_tensor:Steps/Code to reproduce bug
Follow the test code
Megatron-LM/tests/unit_tests/models/test_mimo_1f1b_schedule.pyand add expert_parallel degree can reproduce it.