LangGraph ToolNode: Interrupt Resume Values Misrouting Bug

Dec 3, 2025 by Alex Johnson 59 views

Introduction

In the realm of LangChain and LangGraph, the ToolNode is a powerful component that allows you to integrate various tools into your conversational AI applications. However, a bug has been identified where interrupt resume values are misrouted between tools when using a ToolNode. This issue can lead to unexpected behavior and incorrect results, especially when dealing with multiple tools that utilize interrupts. This article delves into the specifics of this bug, its impact, and potential workarounds.

The Issue: Interrupt Resume Values Misrouting

When using the ToolNode from the langgraph-prebuilt library with multiple tools, each of which calls the interrupt() function, a critical issue arises: the resume values are sometimes incorrectly routed to the wrong tool. This means that the value intended for the first tool's interrupt might be delivered to the second tool, and vice versa. This misrouting leads to a mismatch between the expected and actual behavior of the tools, causing assertions to fail and potentially disrupting the flow of the application.

The core problem seems to stem from a race condition within the ToolNode implementation. When multiple tools raise interrupts simultaneously, the system may assign the same ID to both interrupts. This shared ID then becomes the key for resuming the interrupt, leading to confusion when the resume values are passed back. The tools end up receiving each other's resume values, causing the application to behave erratically.

Understanding the Technical Details

To fully grasp the issue, let's break down the key components involved:

ToolNode: This is a prebuilt component in LangGraph that allows you to execute multiple tools within a graph structure. It simplifies the process of integrating tools into your conversational flows.
interrupt(): This function is used within a tool to pause its execution and request a resume value. It's a crucial mechanism for handling asynchronous operations or external dependencies.
Resume Values: These are the values that are passed back to the interrupted tool to continue its execution. They often contain information needed to proceed with the task.

The bug occurs when the resume values, which are meant to be tool-specific, are mixed up due to the shared interrupt ID. This leads to a situation where a tool receives a resume value that it doesn't expect, causing it to either fail or produce incorrect results.

Impact of the Bug

The consequences of this bug can be significant, especially in complex applications that rely heavily on tool interactions. Misrouted interrupt resume values can lead to:

Incorrect Results: Tools may produce outputs that are inconsistent with the intended logic, leading to inaccurate or misleading information.
Application Crashes: In some cases, the misrouting can cause tools to enter an invalid state, resulting in application crashes or unexpected termination.
Debugging Challenges: The race condition nature of the bug makes it difficult to reproduce and debug, as the issue may not occur consistently.
Reduced Reliability: The overall reliability of the application is compromised, as the bug can introduce unpredictable behavior.

Reproducing the Issue

The provided example code demonstrates how to reproduce the bug. It sets up a LangGraph with two tools, tool_a and tool_b, both of which raise interrupts. The code then attempts to resume these interrupts and checks if the resume values are correctly routed. The bug typically manifests as assertion failures, indicating that the tools received the wrong interrupt values.

The key to triggering the bug is to run the code multiple times, as the race condition may not occur on every execution. The provided test_main function runs the parallel_interrupts function 100 times to increase the likelihood of triggering the issue.

Minimal Reproducible Example (MRE)

To better illustrate the issue, let's examine the provided Minimal Reproducible Example (MRE) code. This code snippet effectively demonstrates the bug and allows developers to reproduce it in their own environments.

import asyncio
import logging
import math
import random
from typing import Annotated, TypedDict

from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.tools import tool
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import END, START, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode
from langgraph.types import Command, interrupt

logger = logging.getLogger(__name__)


@tool
async def tool_a():
    """The tool raises Interrupt A and expects that exact value in order to continue"""
    expected = "Interrupt A"
    await delay()
    result = interrupt(expected)
    assert expected == result, f"Expected {expected}, got {result}"
    return result


@tool
async def tool_b():
    """The tool raises Interrupt B and expects that exact value in order to continue"""
    expected = "Interrupt B"
    await delay()
    result = interrupt(expected)
    assert expected == result, f"Expected {expected}, got {result}"
    return result


async def parallel_interrupts():
    """Resubmitting interrupt messages back to confirm they are channeled correctly"""
    graph = make_graph()
    config = {"configurable": {"thread_id": "test-thread-1"}}

    # First invocation to trigger both tools in parallel
    first_result = await graph.ainvoke(
        {"messages": [HumanMessage(content="Tool calls")]}, config=config, stream_mode="debug"
    )

    first_interrupt = first_result[-1]["payload"]["interrupts"][0]
    first_interrupt_id = first_interrupt["id"]
    first_interrupt_value = first_interrupt["value"]
    logger.info(f"Resuming first interrupt with id {first_interrupt_id} and value {first_interrupt_value}")

    second_result = await graph.ainvoke(
        Command(resume={first_interrupt_id: first_interrupt_value}), config=config, stream_mode="debug"
    )

    # Resume second interrupt
    second_interrupt = second_result[-1]["payload"]["interrupts"][0]
    second_interrupt_id = second_interrupt["id"]
    second_interrupt_value = second_interrupt["value"]
    logger.info(f"Resuming second interrupt with id {second_interrupt_id} and value {second_interrupt_value}")

    _ = await graph.ainvoke(
        Command(resume={second_interrupt["id"]: second_interrupt_value}), config=config, stream_mode="debug"
    )

    assert first_interrupt_id == second_interrupt_id  # never fails


async def test_main():
    num_iterations = 100  # repeat multiple times to trigger race condition
    for i in range(num_iterations):
        await parallel_interrupts()


async def delay():
    """
    Simulate async work with random delays and CPU-bound tasks to force the event loop switch coroutines.
    """
    await asyncio.sleep(random.uniform(0.0, 0.05))

    for _ in range(random.randint(3, 10)):
        x = 0
        for i in range(1000):
            x += math.sin(i + i**2)
        await asyncio.sleep(0)

    await asyncio.sleep(random.uniform(0.0, 0.05))


def agent_node(*args, **kwargs):
    return {
        "messages": [
            AIMessage(
                content="Calling both tools in parallel",
                tool_calls=[
                    {"name": "tool_a", "args": {}, "id": "call_a", "type": "tool_call"},
                    {"name": "tool_b", "args": {}, "id": "call_b", "type": "tool_call"},
                ],
            )
        ]
    }


class State(TypedDict):
    messages: Annotated[list, add_messages]


def should_continue(state: State):
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return END


def make_graph():
    graph = StateGraph(State)
    graph.add_node("agent", agent_node)
    graph.add_node("tools", ToolNode([tool_a, tool_b], handle_tool_errors=False))
    graph.add_edge(START, "agent")
    graph.add_conditional_edges("agent", should_continue, ["tools", END])
    graph.add_edge("tools", END)
    memory = MemorySaver()  # for interrupt support
    app = graph.compile(checkpointer=memory)
    return app


if __name__ == "__main__":
    asyncio.run(test_main())

Code Breakdown

Imports: The code begins by importing necessary libraries, including asyncio, logging, langchain_core, langgraph, and others.
Tool Definitions: Two tools, tool_a and tool_b, are defined using the @tool decorator. Each tool raises an interrupt with a specific value ("Interrupt A" and "Interrupt B", respectively) and asserts that the resumed value matches the expected value.
parallel_interrupts Function: This function creates a LangGraph with the two tools and then invokes it. It captures the interrupt IDs and values and attempts to resume the interrupts. The key part is the assertion assert first_interrupt_id == second_interrupt_id, which highlights the issue of shared interrupt IDs.
test_main Function: This function runs the parallel_interrupts function multiple times to increase the chances of triggering the race condition.
Graph Creation: The make_graph function defines the LangGraph structure, including the agent node, the tool node, and the edges connecting them.
Asynchronous Delay: The delay function simulates asynchronous work with random delays and CPU-bound tasks, which helps to trigger the race condition.

Expected Behavior vs. Actual Behavior

The expected behavior is that each tool should receive its own interrupt value back when resumed. However, the actual behavior is that the interrupt values are sometimes misrouted, leading to assertion failures in the tools.

Running the MRE

To run the MRE, you need to have the necessary dependencies installed. These include langchain-core, langchain, langsmith, langchain-google-vertexai, langchain-text-splitters, langgraph, langgraph-prebuilt, pydantic, pytest, and typing-extensions. Once the dependencies are installed, you can simply execute the Python script, and it will run the tests and demonstrate the bug.

Workarounds and Solutions

As of the current version of LangGraph, there isn't a definitive workaround for this issue. However, here are a few potential strategies that might help mitigate the problem:

Avoid Parallel Interrupts: If possible, try to structure your application to avoid raising multiple interrupts simultaneously. This might involve redesigning the flow or using alternative mechanisms for handling asynchronous operations.
Implement Custom Interrupt Handling: Instead of relying on the built-in interrupt mechanism, you could implement your own interrupt handling logic. This would give you more control over the interrupt IDs and resume value routing.
Use a Single ToolNode: If feasible, consolidate your tools into a single ToolNode. This might reduce the chances of the race condition occurring.
Contribute to LangGraph: The best long-term solution is to contribute to the LangGraph project by submitting a bug fix. This would benefit the entire community and ensure the stability of the library.

Contributing to LangGraph

If you're interested in contributing to LangGraph, you can follow these steps:

Fork the LangGraph repository on GitHub.
Create a new branch for your bug fix or feature.
Implement the fix or feature in your branch.
Write tests to ensure the fix or feature works correctly.
Submit a pull request to the main LangGraph repository.

System Information

The issue has been reported on the following system:

OS: Darwin
OS Version: Darwin Kernel Version 24.6.0: Wed Oct 15 21:12:05 PDT 2025; root:xnu-11417.140.69.703.14~1/RELEASE_ARM64_T6030
Python Version: 3.11.12 (main, Apr 8 2025, 14:15:29) [Clang 16.0.0 (clang-1600.0.26.6)]

Package Information

The following packages are being used:

langchain_core: 1.1.0
langchain: 1.1.0
langsmith: 0.4.12
langchain_google_vertexai: 2.1.2
langchain_text_splitters: 0.3.9
langgraph: 1.0.4
langgraph-prebuilt: 1.0.5
pydantic: 2.11.7
pytest: 8.3.3
typing-extensions: 4.14.1

Conclusion

The interrupt resume values misrouting bug in LangGraph's ToolNode is a significant issue that can lead to unpredictable behavior and incorrect results. Understanding the root cause of the bug and its impact is crucial for developing robust and reliable conversational AI applications. While there isn't a definitive workaround yet, the strategies outlined in this article can help mitigate the problem. The best long-term solution is to contribute to the LangGraph project and help fix the bug.

For more information on LangChain and LangGraph, please visit the official LangChain documentation. This will provide you with further insights and resources to enhance your understanding and utilization of these powerful tools.