Skip to content

AttributeError: 'dict' object has no attribute 'device' in forward_step_calc_loss for Multimodal (MiMo) models +EP #4474

@chenhongyu2048

Description

@chenhongyu2048

Describe the bug

When training a multimodal/MiMo model with Pipeline Parallelism (heterogeneous Vision Encoder + LLM stages in mcore-v0.17.0) and features like MoE or MTP, the training crashes during the forward step. @mcore-oncall

The error happens in megatron/core/pipeline_parallel/schedules.py inside the forward_step_calc_loss function. The code assumes output_tensor is always a torch.Tensor and directly accesses output_tensor.device to set up the loss scale for MoE auxiliary loss or MTP loss. However, for multimodal pipelines (like Vision Encoders), output_tensor can be a dictionary mapping module names to tensors.

Cause errors such like:

  File "/Megatron-LM/mllm/train.py", line 590, in run
    losses = schedule.forward_backward_pipelining_without_interleaving(...)
  File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 2209, in forward_backward_pipelining_without_interleaving
    output_tensor, num_tokens = forward_step(...)
  File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 433, in forward_step
    output_tensor, num_tokens = forward_step_calc_loss(...)
  File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 290, in forward_step_calc_loss
    else torch.ones(1, device=output_tensor.device)
AttributeError: 'dict' object has no attribute 'device'

Print the output_tensor:

output_tensor in forward_step_calc_loss: <class 'dict'>, dict with keys dict_keys(['images'])
output_tensor in forward_step_calc_loss: <class 'torch.Tensor'>, 153.93899536132812

Steps/Code to reproduce bug

Follow the test code Megatron-LM/tests/unit_tests/models/test_mimo_1f1b_schedule.py and add expert_parallel degree can reproduce it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions