AttributeError: 'dict' object has no attribute 'device' in forward_step_calc_loss for Multimodal (MiMo) models +EP

**Describe the bug**

When training a multimodal/MiMo model with Pipeline Parallelism (heterogeneous Vision Encoder + LLM stages in mcore-v0.17.0) and features like MoE or MTP, the training crashes during the forward step. [@mcore-oncall](https://github.com/orgs/NVIDIA/teams/mcore-oncall) 

The error happens in `megatron/core/pipeline_parallel/schedules.py` inside the `forward_step_calc_loss` function. The code assumes output_tensor is always a `torch.Tensor` and directly accesses `output_tensor.device` to set up the loss scale for MoE auxiliary loss or MTP loss. However, for multimodal pipelines (like Vision Encoders), `output_tensor` can be a dictionary mapping module names to tensors.

Cause errors such like:

      File "/Megatron-LM/mllm/train.py", line 590, in run
        losses = schedule.forward_backward_pipelining_without_interleaving(...)
      File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 2209, in forward_backward_pipelining_without_interleaving
        output_tensor, num_tokens = forward_step(...)
      File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 433, in forward_step
        output_tensor, num_tokens = forward_step_calc_loss(...)
      File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 290, in forward_step_calc_loss
        else torch.ones(1, device=output_tensor.device)
    AttributeError: 'dict' object has no attribute 'device'

Print the `output_tensor`: 

    output_tensor in forward_step_calc_loss: <class 'dict'>, dict with keys dict_keys(['images'])
    output_tensor in forward_step_calc_loss: <class 'torch.Tensor'>, 153.93899536132812

**Steps/Code to reproduce bug**

Follow the test code `Megatron-LM/tests/unit_tests/models/test_mimo_1f1b_schedule.py` and add expert_parallel degree can reproduce it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'dict' object has no attribute 'device' in forward_step_calc_loss for Multimodal (MiMo) models +EP #4474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AttributeError: 'dict' object has no attribute 'device' in forward_step_calc_loss for Multimodal (MiMo) models +EP #4474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions