Qwen30B MoE Profiling Issues On VLLM 0.11.0rc1

by Alex Johnson 47 views

Introduction

In this article, we delve into the profiling anomalies observed with the Qwen30B MoE (Mixture of Experts) model when running on vLLM version 0.11.0rc1. These anomalies manifest as unexpected collective communication operations and prolonged execution times for specific fused operators, ultimately leading to a significant drop in overall inference performance. Understanding the root causes of these issues is crucial for optimizing the performance of large language models in distributed environments. We will explore the specific problems encountered, analyze potential causes, and discuss the implications for practical deployment.

Environment Configuration

Before diving into the specifics of the bug, let's outline the environment in which these issues were observed. This context is crucial for understanding the configuration and potential sources of the problem.

The experiments were conducted using a dual-machine setup, each equipped with 16 GPUs. The distributed setup was configured with the following parameters:

  • World Size: 16
  • Data Parallelism (DP): 16
  • Tensor Parallelism (TP): 1
  • Expert Parallelism (EP): 16

The commands used to run the inference on each node are as follows:

Node 0:

python examples/offline_data_parallel.py \
                    --model="/mnt/data2/Qwen3-30B-A3B/" \
                    --dp-size=16 \
                    --tp-size=1 \
                    --node-size=2 \
                    --node-rank=0 \
                    --enable-expert-parallel \
                    --master-addr=192.168.0.111 \
                    --master-port=13345 \
                    --enforce-eager

Node 1:

python examples/offline_data_parallel.py \
                    --model="/mnt/data2/Qwen3-30B-A3B/" \
                    --dp-size=16 \
                    --tp-size=1 \
                    --node-size=2 \
                    --node-rank=1 \
                    --enable-expert-parallel \
                    --master-addr=192.168.0.111 \
                    --master-port=13345 \
                    --enforce-eager

The software environment details, including package versions, are listed below:

Package                           Version       Editable project location
--------------------------------- ------------- --------------------------
aiofiles                          25.1.0
aiohappyeyeballs                  2.6.1
aiohttp                           3.13.2
aiosignal                         1.4.0
annotated-doc                     0.0.4
annotated-types                   0.7.0
anyio                             4.11.0
astor                             0.8.1
async-timeout                     5.0.1
attrs                             25.4.0
auto-tune                         0.1.0
blake3                            1.0.8
blinker                           1.9.0
cachetools                        6.2.2
cbor2                             5.7.1
certifi                           2025.11.12
cffi                              2.0.0
charset-normalizer                3.4.4
click                             8.3.1
cloudpickle                       3.1.2
cmake                             4.2.0
compressed-tensors                0.11.0
dataflow                          0.0.1
decorator                         5.2.1
depyf                             0.19.0
dill                              0.4.0
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.8.0
einops                            0.8.1
email-validator                   2.3.0
exceptiongroup                    1.3.1
fastapi                           0.122.0
fastapi-cli                       0.0.16
fastapi-cloud-cli                 0.5.1
fastar                            0.7.0
filelock                          3.20.0
Flask                             3.1.2
frozendict                        2.4.7
frozenlist                        1.8.0
fsspec                            2025.10.0
gguf                              0.17.1
h11                               0.16.0
h2                                4.3.0
hccl                              0.1.0
hccl-parser                       0.1
hf-xet                            1.2.0
hpack                             4.1.0
httpcore                          1.0.9
httptools                         0.7.1
httpx                             0.28.1
huggingface-hub                   0.36.0
Hypercorn                         0.18.0
hyperframe                        6.1.0
idna                              3.11
interegular                       0.3.3
itsdangerous                      2.2.0
Jinja2                            3.1.6
jiter                             0.12.0
jsonschema                        4.25.1
jsonschema-specifications         2025.9.1
lark                              1.2.2
llguidance                        0.7.30
llvmlite                          0.45.1
lm-format-enforcer                0.11.3
markdown-it-py                    4.0.0
MarkupSafe                        3.0.3
mdurl                             0.1.2
mistral_common                    1.8.5
modelscope                        1.32.0
mpmath                            1.3.0
msadvisor                         1.0.0
msgpack                           1.1.2
msgspec                           0.20.0
multidict                         6.7.0
networkx                          3.4.2
ninja                             1.13.0
numba                             0.62.1
numpy                             1.26.4
op-compile-tool                   0.1.0
op-gen                            0.1
op-test-frame                     0.1
opc-tool                          0.1.0
openai                            2.8.1
openai-harmony                    0.0.8
opencv-python-headless            4.11.0.86
outlines_core                     0.2.11
packaging                         25.0
partial-json-parser               0.2.1.1.post7
pillow                            12.0.0
pip                               25.3
priority                          2.0.0
prometheus_client                 0.23.1
prometheus-fastapi-instrumentator 7.1.0
propcache                         0.4.1
protobuf                          6.33.1
psutil                            7.1.3
py-cpuinfo                        9.0.0
pybase64                          1.4.2
pybind11                          3.0.1
pycountry                         24.6.1
pycparser                         2.23
pydantic                          2.12.4
pydantic_core                     2.41.5
pydantic-extra-types              2.10.6
Pygments                          2.19.2
python-dotenv                     1.2.1
python-json-logger                4.0.0
python-multipart                  0.0.20
PyYAML                            6.0.3
pyzmq                             27.1.0
Quart                             0.20.0
referencing                       0.37.0
regex                             2025.11.3
requests                          2.32.5
rich                              14.2.0
rich-toolkit                      0.16.0
rignore                           0.7.6
rpds-py                           0.29.0
safetensors                       0.7.0
schedule-search                   0.0.1
scipy                             1.15.3
sentencepiece                     0.2.1
sentry-sdk                        2.46.0
setproctitle                      1.3.7
setuptools                        80.9.0
setuptools-scm                    9.2.2
shellingham                       1.5.4
sniffio                           1.3.1
soundfile                         0.13.1
soxr                              1.0.0
starlette                         0.50.0
sympy                             1.14.0
taskgroup                         0.2.2
te                                0.4.0
tiktoken                          0.12.0
tokenizers                        0.22.1
tomli                             2.3.0
torch                             2.7.1+cpu
torch_npu                         2.7.1
torchvision                       0.22.1
tqdm                              4.67.1
transformers                      4.57.2
typer                             0.20.0
typing_extensions                 4.15.0
typing-inspection                 0.4.2
urllib3                           2.5.0
uvicorn                           0.38.0
uvloop                            0.22.1
vllm                              0.11.0+empty  /mnt/data3/fyj/vllm
vllm_ascend                       0.11.0rc2     /mnt/data3/fyj/vllm-ascend
watchfiles                        1.1.1
websockets                        15.0.1
Werkzeug                          3.1.3
wheel                             0.45.1
wsproto                           1.3.2
xgrammar                          0.1.25
yarl                              1.22.0

With this environment configured, the profiling issues were observed, which we will now discuss in detail.

Bug Description: Profiling Anomalies

The core issue observed is the presence of profiling anomalies during the execution of the Qwen30B MoE model on vLLM 0.11.0rc1. These anomalies are primarily characterized by:

  1. Unexpected Collective Communication Operations: The profiling timeline reveals the presence of Hcom_allgather and Hcom_allreduce operations. Given the parallel configuration (world_size=16, dp=16, tp=1, ep=16), where tensor parallelism (TP) is set to 1, the appearance of Allreduce operations is unexpected. These collective communication operations consume a significant portion of the execution time.

    It's crucial to highlight that Allreduce and Allgather are typical in tensor parallelism, where model weights are distributed across multiple devices. However, with TP disabled, these operations should not be the bottleneck. The profiling data clearly shows that these operations introduce substantial latency due to extensive wait times.

    Image
  2. Prolonged Execution Time for Fused Operators: The fused operators MoeDistributeDispatchV2 and MoeDistributeCombineV2 exhibit unusually long execution times.

    • MoeDistributeDispatchV2: Wall Duration (wall-clock time) of approximately 17 ms.
    • MoeDistributeCombineV2: Wall Duration of approximately 4 ms.

    For fused operators of this nature, an execution time of 17 ms is considered excessive. In comparison, under vLLM 0.9.1, these operators typically execute in around 500 microseconds. The significant increase in execution time points to potential inefficiencies in the operator implementation or the interaction with the underlying hardware.

    Image Image
  3. Severe Degradation in Overall Inference Performance: The combined effect of unexpected communication overhead and prolonged operator execution times results in a drastic reduction in inference performance. A single Decode operation takes approximately 62 ms, and a complete Decode process can take around 2 seconds. This translates to an inference speed of roughly 0.5 tokens per second, which is significantly below the expected performance level and practically unusable for real-world applications.

    This level of performance degradation is a critical issue, making the model unsuitable for deployment in scenarios requiring low latency and high throughput.

    Image

Code Snippet

To provide further context, the code snippet used for running the inference is provided below. This code is based on the offline_data_parallel.py example from the vLLM-Ascend repository. This script sets up data parallelism and manages the inference process across multiple devices.

import contextlib
import gc
import os
from time import sleep

import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (  # noqa E402
    destroy_distributed_environment, destroy_model_parallel)
from vllm.utils import get_open_port

os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

def parse_args():
    import argparse

    parser = argparse.ArgumentParser(description="Data Parallel Inference")
    parser.add_argument(
        "--model",
        type=str,
        default="ibm-research/PowerMoE-3b",
        help="Model name or path",
    )
    parser.add_argument("--dp-size",
                        type=int,
                        default=2,
                        help="Data parallel size")
    parser.add_argument("--tp-size",
                        type=int,
                        default=1,
                        help="Tensor parallel size")
    parser.add_argument("--node-size",
                        type=int,
                        default=1,
                        help="Total number of nodes")
    parser.add_argument("--node-rank",
                        type=int,
                        default=0,
                        help="Rank of the current node")
    parser.add_argument("--master-addr",
                        type=str,
                        default="",
                        help="Master node IP address")
    parser.add_argument("--master-port",
                        type=int,
                        default=0,
                        help="Master node port")
    parser.add_argument("--enforce-eager",
                        action="store_true",
                        help="Enforce eager mode execution.")
    parser.add_argument("--trust-remote-code",
                        action="store_true",
                        help="Trust remote code.")
    parser.add_argument("--enable-expert-parallel",
                        action="store_true",
                        help="Enable expert parallel, used in MOE models.")
    return parser.parse_args()

def cleanup_env_and_memory():
    destroy_model_parallel()
    destroy_distributed_environment()
    with contextlib.suppress(AssertionError):
        torch.distributed.destroy_process_group()
    gc.collect()
    torch.npu.empty_cache()
    torch.npu.reset_peak_memory_stats()

def main(
    model,
    dp_size,
    local_dp_rank,
    global_dp_rank,
    dp_master_ip,
    dp_master_port,
    GPUs_per_dp_rank,
    enable_expert_parallel,
    enforce_eager,
    trust_remote_code,
):
    # DP only support on V1 engine
    os.environ["VLLM_DP_RANK"] = str(global_dp_rank)
    os.environ["VLLM_DP_RANK_LOCAL"] = str(local_dp_rank)
    os.environ["VLLM_DP_SIZE"] = str(dp_size)
    os.environ["VLLM_DP_MASTER_IP"] = dp_master_ip
    os.environ["VLLM_DP_MASTER_PORT"] = str(dp_master_port)
    os.environ['VLLM_TORCH_PROFILER_DIR'] = '/mnt/data3/fyj/vllm-ascend/profiler/dp4_tp4_ep16/max_token_16/npu_moe_distribute_dispatch_v2'

    prompts = ["The operation $\\times$ is defined for all nonzero numbers by $a \\otimes b = \\frac{a^{2}}{b}$. Determine $[(1 \\otimes 2) \\otimes 3] - [1 \\otimes (2 \\otimes 3)]{{content}}quot;] * 128
    if len(prompts) == 0:
        # if any rank has no prompts to process,
        # we need to set a placeholder prompt
        prompts = ["Placeholder"]
    print(f"DP rank {global_dp_rank} needs to process {len(prompts)} prompts")

    # Create a sampling params object.
    # since we are doing data parallel, every rank can have different
    # sampling params. here we set different max_tokens for different
    # ranks for demonstration.
    sampling_params = SamplingParams(temperature=0.8,
                                     top_p=0.95,
                                     max_tokens=16)

    # Create an LLM.
    llm = LLM(
        model=model,
        tensor_parallel_size=GPUs_per_dp_rank,
        enforce_eager=enforce_eager,
        enable_expert_parallel=enable_expert_parallel,
        trust_remote_code=trust_remote_code,
        tokenizer="/mnt/data2/Qwen3-30B-A3B/",
        tokenizer_mode="slow"
    )

    llm.start_profile()

    outputs = llm.generate(prompts, sampling_params)

    llm.stop_profile()
    # Print the outputs.
    for i, output in enumerate(outputs):
        if i >= 5:
            # print only 5 outputs
            break
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
              f"Generated text: {generated_text!r}")

    # Give engines time to pause their processing loops before exiting.
    sleep(5)
    del llm
    cleanup_env_and_memory()

if __name__ == "__main__":
    args = parse_args()

    dp_size = args.dp_size
    tp_size = args.tp_size
    node_size = args.node_size
    node_rank = args.node_rank

    if node_size == 1:
        dp_master_ip = "127.0.0.1"
        dp_master_port = get_open_port()
    else:
        dp_master_ip = args.master_addr
        dp_master_port = args.master_port

    assert dp_size % node_size == 0, "dp_size should be divisible by node_size"
    dp_per_node = dp_size // node_size

    from multiprocessing import Process

    procs = []
    for local_dp_rank, global_dp_rank in enumerate(
            range(node_rank * dp_per_node, (node_rank + 1) * dp_per_node)):
        proc = Process(
            target=main,
            args=(
                args.model,
                dp_size,
                local_dp_rank,
                global_dp_rank,
                dp_master_ip,
                dp_master_port,
                tp_size,
                args.enable_expert_parallel,
                args.enforce_eager,
                args.trust_remote_code,
            ),
        )
        proc.start()
        procs.append(proc)
    exit_code = 0
    for proc in procs:
        proc.join(timeout=900)
        if proc.exitcode is None:
            print(
                f"Killing process {proc.pid} that didn't stop within 15 minutes."
            )
            proc.kill()
            exit_code = 1
        elif proc.exitcode:
            exit_code = proc.exitcode

    exit(exit_code)

This code uses data parallelism to distribute the inference workload across multiple GPUs. It initializes the vLLM engine, loads the Qwen30B MoE model, and generates text for a set of prompts. The profiling data was collected during the execution of this script.

Root Cause Analysis: Potential Issues

Given the observed phenomena, several potential root causes can be hypothesized:

  1. Profiling Tool Overhead: One primary suspect is the profiling tool itself. It is possible that the profiling mechanism introduced in vLLM 0.11.0rc1 has a significant overhead, particularly when dealing with MoE models and distributed setups. The act of collecting profiling data might be interfering with the model's execution, leading to inflated execution times and the appearance of spurious communication operations.

    If the profiling tool is excessively intrusive, it could distort the performance characteristics of the model, making it difficult to identify genuine bottlenecks.

  2. Inefficient Fused Operator Implementation: The extended execution times for MoeDistributeDispatchV2 and MoeDistributeCombineV2 suggest potential inefficiencies in their implementation. These operators are crucial for managing the distribution of work among experts in the MoE model. If these operators are not optimized for the specific hardware and software environment, they can become a bottleneck.

    A closer examination of the operator code and its interaction with the underlying hardware is necessary to pinpoint potential areas for optimization.

  3. Unnecessary Collective Communication: The presence of Allreduce and Allgather operations despite TP being disabled indicates a potential issue in the communication strategy. It is possible that some part of the vLLM engine or the MoE model implementation is inadvertently triggering these operations.

    This could be due to a configuration error, a bug in the communication logic, or an interaction between different components of the system.

  4. Hardware-Specific Issues: The performance anomalies might also be related to the specific hardware being used. The interaction between vLLM and the underlying hardware, such as the GPUs and the network interconnect, can significantly impact performance. Driver issues, suboptimal hardware configurations, or limitations in the hardware itself could be contributing factors.

    Profiling on different hardware setups and comparing the results can help identify hardware-specific issues.

Hypothesis Validation

To validate these hypotheses, several steps can be taken:

  1. Disable Profiling: Run the inference without profiling to see if the performance improves. If the performance is significantly better without profiling, it suggests that the profiling tool is indeed a major contributor to the observed anomalies.
  2. Operator Performance Analysis: Use more fine-grained profiling tools to analyze the execution of MoeDistributeDispatchV2 and MoeDistributeCombineV2. This can help identify specific parts of the operators that are causing the slowdown.
  3. Communication Analysis: Investigate the communication patterns more closely to determine why Allreduce and Allgather operations are being triggered. Tools for monitoring network traffic and communication within the distributed system can be helpful.
  4. Hardware Testing: Run the inference on different hardware configurations to see if the performance issues are specific to the current setup.

Comparison with vLLM 0.9.1

The performance data from vLLM 0.9.1 provides a valuable baseline for comparison. The fact that the fused operators executed much faster (around 500 microseconds) in the older version suggests that there might be a regression in the newer version. Understanding the changes between vLLM 0.9.1 and vLLM 0.11.0rc1, especially those related to MoE model support and distributed execution, could shed light on the root cause.

Conclusion

The profiling anomalies observed with the Qwen30B MoE model on vLLM 0.11.0rc1 represent a significant challenge for deploying large language models in distributed environments. The unexpected collective communication operations, prolonged execution times for fused operators, and overall degradation in inference performance point to potential issues with the profiling tool, operator implementations, communication strategies, or hardware interactions. Further investigation and validation are necessary to pinpoint the exact root causes and implement effective solutions. Addressing these issues is crucial for realizing the full potential of MoE models and ensuring efficient and scalable inference.

For more information on profiling and debugging distributed deep learning applications, you can refer to resources like the PyTorch Profiler documentation.