Qwen30B MoE Profiling Issues On VLLM 0.11.0rc1
Introduction
In this article, we delve into the profiling anomalies observed with the Qwen30B MoE (Mixture of Experts) model when running on vLLM version 0.11.0rc1. These anomalies manifest as unexpected collective communication operations and prolonged execution times for specific fused operators, ultimately leading to a significant drop in overall inference performance. Understanding the root causes of these issues is crucial for optimizing the performance of large language models in distributed environments. We will explore the specific problems encountered, analyze potential causes, and discuss the implications for practical deployment.
Environment Configuration
Before diving into the specifics of the bug, let's outline the environment in which these issues were observed. This context is crucial for understanding the configuration and potential sources of the problem.
The experiments were conducted using a dual-machine setup, each equipped with 16 GPUs. The distributed setup was configured with the following parameters:
- World Size: 16
- Data Parallelism (DP): 16
- Tensor Parallelism (TP): 1
- Expert Parallelism (EP): 16
The commands used to run the inference on each node are as follows:
Node 0:
python examples/offline_data_parallel.py \
--model="/mnt/data2/Qwen3-30B-A3B/" \
--dp-size=16 \
--tp-size=1 \
--node-size=2 \
--node-rank=0 \
--enable-expert-parallel \
--master-addr=192.168.0.111 \
--master-port=13345 \
--enforce-eager
Node 1:
python examples/offline_data_parallel.py \
--model="/mnt/data2/Qwen3-30B-A3B/" \
--dp-size=16 \
--tp-size=1 \
--node-size=2 \
--node-rank=1 \
--enable-expert-parallel \
--master-addr=192.168.0.111 \
--master-port=13345 \
--enforce-eager
The software environment details, including package versions, are listed below:
Package Version Editable project location
--------------------------------- ------------- --------------------------
aiofiles 25.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.2
aiosignal 1.4.0
annotated-doc 0.0.4
annotated-types 0.7.0
anyio 4.11.0
astor 0.8.1
async-timeout 5.0.1
attrs 25.4.0
auto-tune 0.1.0
blake3 1.0.8
blinker 1.9.0
cachetools 6.2.2
cbor2 5.7.1
certifi 2025.11.12
cffi 2.0.0
charset-normalizer 3.4.4
click 8.3.1
cloudpickle 3.1.2
cmake 4.2.0
compressed-tensors 0.11.0
dataflow 0.0.1
decorator 5.2.1
depyf 0.19.0
dill 0.4.0
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
einops 0.8.1
email-validator 2.3.0
exceptiongroup 1.3.1
fastapi 0.122.0
fastapi-cli 0.0.16
fastapi-cloud-cli 0.5.1
fastar 0.7.0
filelock 3.20.0
Flask 3.1.2
frozendict 2.4.7
frozenlist 1.8.0
fsspec 2025.10.0
gguf 0.17.1
h11 0.16.0
h2 4.3.0
hccl 0.1.0
hccl-parser 0.1
hf-xet 1.2.0
hpack 4.1.0
httpcore 1.0.9
httptools 0.7.1
httpx 0.28.1
huggingface-hub 0.36.0
Hypercorn 0.18.0
hyperframe 6.1.0
idna 3.11
interegular 0.3.3
itsdangerous 2.2.0
Jinja2 3.1.6
jiter 0.12.0
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
lark 1.2.2
llguidance 0.7.30
llvmlite 0.45.1
lm-format-enforcer 0.11.3
markdown-it-py 4.0.0
MarkupSafe 3.0.3
mdurl 0.1.2
mistral_common 1.8.5
modelscope 1.32.0
mpmath 1.3.0
msadvisor 1.0.0
msgpack 1.1.2
msgspec 0.20.0
multidict 6.7.0
networkx 3.4.2
ninja 1.13.0
numba 0.62.1
numpy 1.26.4
op-compile-tool 0.1.0
op-gen 0.1
op-test-frame 0.1
opc-tool 0.1.0
openai 2.8.1
openai-harmony 0.0.8
opencv-python-headless 4.11.0.86
outlines_core 0.2.11
packaging 25.0
partial-json-parser 0.2.1.1.post7
pillow 12.0.0
pip 25.3
priority 2.0.0
prometheus_client 0.23.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
protobuf 6.33.1
psutil 7.1.3
py-cpuinfo 9.0.0
pybase64 1.4.2
pybind11 3.0.1
pycountry 24.6.1
pycparser 2.23
pydantic 2.12.4
pydantic_core 2.41.5
pydantic-extra-types 2.10.6
Pygments 2.19.2
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.20
PyYAML 6.0.3
pyzmq 27.1.0
Quart 0.20.0
referencing 0.37.0
regex 2025.11.3
requests 2.32.5
rich 14.2.0
rich-toolkit 0.16.0
rignore 0.7.6
rpds-py 0.29.0
safetensors 0.7.0
schedule-search 0.0.1
scipy 1.15.3
sentencepiece 0.2.1
sentry-sdk 2.46.0
setproctitle 1.3.7
setuptools 80.9.0
setuptools-scm 9.2.2
shellingham 1.5.4
sniffio 1.3.1
soundfile 0.13.1
soxr 1.0.0
starlette 0.50.0
sympy 1.14.0
taskgroup 0.2.2
te 0.4.0
tiktoken 0.12.0
tokenizers 0.22.1
tomli 2.3.0
torch 2.7.1+cpu
torch_npu 2.7.1
torchvision 0.22.1
tqdm 4.67.1
transformers 4.57.2
typer 0.20.0
typing_extensions 4.15.0
typing-inspection 0.4.2
urllib3 2.5.0
uvicorn 0.38.0
uvloop 0.22.1
vllm 0.11.0+empty /mnt/data3/fyj/vllm
vllm_ascend 0.11.0rc2 /mnt/data3/fyj/vllm-ascend
watchfiles 1.1.1
websockets 15.0.1
Werkzeug 3.1.3
wheel 0.45.1
wsproto 1.3.2
xgrammar 0.1.25
yarl 1.22.0
With this environment configured, the profiling issues were observed, which we will now discuss in detail.
Bug Description: Profiling Anomalies
The core issue observed is the presence of profiling anomalies during the execution of the Qwen30B MoE model on vLLM 0.11.0rc1. These anomalies are primarily characterized by:
-
Unexpected Collective Communication Operations: The profiling timeline reveals the presence of
Hcom_allgatherandHcom_allreduceoperations. Given the parallel configuration (world_size=16, dp=16, tp=1, ep=16), where tensor parallelism (TP) is set to 1, the appearance of Allreduce operations is unexpected. These collective communication operations consume a significant portion of the execution time.It's crucial to highlight that Allreduce and Allgather are typical in tensor parallelism, where model weights are distributed across multiple devices. However, with TP disabled, these operations should not be the bottleneck. The profiling data clearly shows that these operations introduce substantial latency due to extensive wait times.
-
Prolonged Execution Time for Fused Operators: The fused operators
MoeDistributeDispatchV2andMoeDistributeCombineV2exhibit unusually long execution times.MoeDistributeDispatchV2: Wall Duration (wall-clock time) of approximately 17 ms.MoeDistributeCombineV2: Wall Duration of approximately 4 ms.
For fused operators of this nature, an execution time of 17 ms is considered excessive. In comparison, under vLLM 0.9.1, these operators typically execute in around 500 microseconds. The significant increase in execution time points to potential inefficiencies in the operator implementation or the interaction with the underlying hardware.
-
Severe Degradation in Overall Inference Performance: The combined effect of unexpected communication overhead and prolonged operator execution times results in a drastic reduction in inference performance. A single Decode operation takes approximately 62 ms, and a complete Decode process can take around 2 seconds. This translates to an inference speed of roughly 0.5 tokens per second, which is significantly below the expected performance level and practically unusable for real-world applications.
This level of performance degradation is a critical issue, making the model unsuitable for deployment in scenarios requiring low latency and high throughput.
Code Snippet
To provide further context, the code snippet used for running the inference is provided below. This code is based on the offline_data_parallel.py example from the vLLM-Ascend repository. This script sets up data parallelism and manages the inference process across multiple devices.
import contextlib
import gc
import os
from time import sleep
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import ( # noqa E402
destroy_distributed_environment, destroy_model_parallel)
from vllm.utils import get_open_port
os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
def parse_args():
import argparse
parser = argparse.ArgumentParser(description="Data Parallel Inference")
parser.add_argument(
"--model",
type=str,
default="ibm-research/PowerMoE-3b",
help="Model name or path",
)
parser.add_argument("--dp-size",
type=int,
default=2,
help="Data parallel size")
parser.add_argument("--tp-size",
type=int,
default=1,
help="Tensor parallel size")
parser.add_argument("--node-size",
type=int,
default=1,
help="Total number of nodes")
parser.add_argument("--node-rank",
type=int,
default=0,
help="Rank of the current node")
parser.add_argument("--master-addr",
type=str,
default="",
help="Master node IP address")
parser.add_argument("--master-port",
type=int,
default=0,
help="Master node port")
parser.add_argument("--enforce-eager",
action="store_true",
help="Enforce eager mode execution.")
parser.add_argument("--trust-remote-code",
action="store_true",
help="Trust remote code.")
parser.add_argument("--enable-expert-parallel",
action="store_true",
help="Enable expert parallel, used in MOE models.")
return parser.parse_args()
def cleanup_env_and_memory():
destroy_model_parallel()
destroy_distributed_environment()
with contextlib.suppress(AssertionError):
torch.distributed.destroy_process_group()
gc.collect()
torch.npu.empty_cache()
torch.npu.reset_peak_memory_stats()
def main(
model,
dp_size,
local_dp_rank,
global_dp_rank,
dp_master_ip,
dp_master_port,
GPUs_per_dp_rank,
enable_expert_parallel,
enforce_eager,
trust_remote_code,
):
# DP only support on V1 engine
os.environ["VLLM_DP_RANK"] = str(global_dp_rank)
os.environ["VLLM_DP_RANK_LOCAL"] = str(local_dp_rank)
os.environ["VLLM_DP_SIZE"] = str(dp_size)
os.environ["VLLM_DP_MASTER_IP"] = dp_master_ip
os.environ["VLLM_DP_MASTER_PORT"] = str(dp_master_port)
os.environ['VLLM_TORCH_PROFILER_DIR'] = '/mnt/data3/fyj/vllm-ascend/profiler/dp4_tp4_ep16/max_token_16/npu_moe_distribute_dispatch_v2'
prompts = ["The operation $\\times$ is defined for all nonzero numbers by $a \\otimes b = \\frac{a^{2}}{b}$. Determine $[(1 \\otimes 2) \\otimes 3] - [1 \\otimes (2 \\otimes 3)]{{content}}quot;] * 128
if len(prompts) == 0:
# if any rank has no prompts to process,
# we need to set a placeholder prompt
prompts = ["Placeholder"]
print(f"DP rank {global_dp_rank} needs to process {len(prompts)} prompts")
# Create a sampling params object.
# since we are doing data parallel, every rank can have different
# sampling params. here we set different max_tokens for different
# ranks for demonstration.
sampling_params = SamplingParams(temperature=0.8,
top_p=0.95,
max_tokens=16)
# Create an LLM.
llm = LLM(
model=model,
tensor_parallel_size=GPUs_per_dp_rank,
enforce_eager=enforce_eager,
enable_expert_parallel=enable_expert_parallel,
trust_remote_code=trust_remote_code,
tokenizer="/mnt/data2/Qwen3-30B-A3B/",
tokenizer_mode="slow"
)
llm.start_profile()
outputs = llm.generate(prompts, sampling_params)
llm.stop_profile()
# Print the outputs.
for i, output in enumerate(outputs):
if i >= 5:
# print only 5 outputs
break
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
f"Generated text: {generated_text!r}")
# Give engines time to pause their processing loops before exiting.
sleep(5)
del llm
cleanup_env_and_memory()
if __name__ == "__main__":
args = parse_args()
dp_size = args.dp_size
tp_size = args.tp_size
node_size = args.node_size
node_rank = args.node_rank
if node_size == 1:
dp_master_ip = "127.0.0.1"
dp_master_port = get_open_port()
else:
dp_master_ip = args.master_addr
dp_master_port = args.master_port
assert dp_size % node_size == 0, "dp_size should be divisible by node_size"
dp_per_node = dp_size // node_size
from multiprocessing import Process
procs = []
for local_dp_rank, global_dp_rank in enumerate(
range(node_rank * dp_per_node, (node_rank + 1) * dp_per_node)):
proc = Process(
target=main,
args=(
args.model,
dp_size,
local_dp_rank,
global_dp_rank,
dp_master_ip,
dp_master_port,
tp_size,
args.enable_expert_parallel,
args.enforce_eager,
args.trust_remote_code,
),
)
proc.start()
procs.append(proc)
exit_code = 0
for proc in procs:
proc.join(timeout=900)
if proc.exitcode is None:
print(
f"Killing process {proc.pid} that didn't stop within 15 minutes."
)
proc.kill()
exit_code = 1
elif proc.exitcode:
exit_code = proc.exitcode
exit(exit_code)
This code uses data parallelism to distribute the inference workload across multiple GPUs. It initializes the vLLM engine, loads the Qwen30B MoE model, and generates text for a set of prompts. The profiling data was collected during the execution of this script.
Root Cause Analysis: Potential Issues
Given the observed phenomena, several potential root causes can be hypothesized:
-
Profiling Tool Overhead: One primary suspect is the profiling tool itself. It is possible that the profiling mechanism introduced in vLLM 0.11.0rc1 has a significant overhead, particularly when dealing with MoE models and distributed setups. The act of collecting profiling data might be interfering with the model's execution, leading to inflated execution times and the appearance of spurious communication operations.
If the profiling tool is excessively intrusive, it could distort the performance characteristics of the model, making it difficult to identify genuine bottlenecks.
-
Inefficient Fused Operator Implementation: The extended execution times for
MoeDistributeDispatchV2andMoeDistributeCombineV2suggest potential inefficiencies in their implementation. These operators are crucial for managing the distribution of work among experts in the MoE model. If these operators are not optimized for the specific hardware and software environment, they can become a bottleneck.A closer examination of the operator code and its interaction with the underlying hardware is necessary to pinpoint potential areas for optimization.
-
Unnecessary Collective Communication: The presence of Allreduce and Allgather operations despite TP being disabled indicates a potential issue in the communication strategy. It is possible that some part of the vLLM engine or the MoE model implementation is inadvertently triggering these operations.
This could be due to a configuration error, a bug in the communication logic, or an interaction between different components of the system.
-
Hardware-Specific Issues: The performance anomalies might also be related to the specific hardware being used. The interaction between vLLM and the underlying hardware, such as the GPUs and the network interconnect, can significantly impact performance. Driver issues, suboptimal hardware configurations, or limitations in the hardware itself could be contributing factors.
Profiling on different hardware setups and comparing the results can help identify hardware-specific issues.
Hypothesis Validation
To validate these hypotheses, several steps can be taken:
- Disable Profiling: Run the inference without profiling to see if the performance improves. If the performance is significantly better without profiling, it suggests that the profiling tool is indeed a major contributor to the observed anomalies.
- Operator Performance Analysis: Use more fine-grained profiling tools to analyze the execution of
MoeDistributeDispatchV2andMoeDistributeCombineV2. This can help identify specific parts of the operators that are causing the slowdown. - Communication Analysis: Investigate the communication patterns more closely to determine why Allreduce and Allgather operations are being triggered. Tools for monitoring network traffic and communication within the distributed system can be helpful.
- Hardware Testing: Run the inference on different hardware configurations to see if the performance issues are specific to the current setup.
Comparison with vLLM 0.9.1
The performance data from vLLM 0.9.1 provides a valuable baseline for comparison. The fact that the fused operators executed much faster (around 500 microseconds) in the older version suggests that there might be a regression in the newer version. Understanding the changes between vLLM 0.9.1 and vLLM 0.11.0rc1, especially those related to MoE model support and distributed execution, could shed light on the root cause.
Conclusion
The profiling anomalies observed with the Qwen30B MoE model on vLLM 0.11.0rc1 represent a significant challenge for deploying large language models in distributed environments. The unexpected collective communication operations, prolonged execution times for fused operators, and overall degradation in inference performance point to potential issues with the profiling tool, operator implementations, communication strategies, or hardware interactions. Further investigation and validation are necessary to pinpoint the exact root causes and implement effective solutions. Addressing these issues is crucial for realizing the full potential of MoE models and ensuring efficient and scalable inference.
For more information on profiling and debugging distributed deep learning applications, you can refer to resources like the PyTorch Profiler documentation.