Prefetch & Host Allocations: Page Migration Issues

by Alex Johnson 51 views

Introduction

In the realm of memory management, the interplay between prefetching and host allocations can sometimes lead to unexpected challenges. This article delves into a specific issue where prefetch operations trigger host allocations due to page migration, resulting in segmentation faults and program crashes. We will explore the root cause of this problem, discuss potential workarounds, and analyze the trade-offs involved. Understanding these intricacies is crucial for developers aiming to optimize memory usage and ensure the stability of their applications. This issue particularly arises when using shared memory allocations, such as zeMemAllocShared or clSharedMemAllocINTEL, for managed memory in environments where a signal handler is installed. Let's dive deep into the specifics.

The Problem: Segfaults After GPU Kernel Execution

When working with managed memory using functions like zeMemAllocShared or clSharedMemAllocINTEL, developers may encounter segmentation faults (segfaults) after the GPU kernel has executed. This issue manifests particularly when a SIGSEGV handler is installed, such as in debugging environments or when using testing frameworks like Catch2. The core issue stems from how Intel's drivers handle shared allocations and page migrations between the host and the GPU.

Root Cause: Page Migration and Permissions

The root cause of these segfaults lies in Intel's driver's use of page migration for shared allocations. When the GPU accesses memory, the host page mappings are left in a ---s state, which means they have no permissions. This state is a consequence of the memory pages being migrated to the GPU for processing. The problem arises when the host code attempts to access this memory after the GPU has processed it.

To illustrate this, consider two scenarios:

  1. Without a Signal Handler: If there is no signal handler installed, the kernel handles the page fault. The driver recognizes the fault, migrates the data back to the host, and the access succeeds. This process is transparent to the application, although it may introduce a performance overhead due to the data migration.
  2. With a Signal Handler: When a signal handler is installed (e.g., by Catch2 or a debugger), the signal handler intercepts the SIGSEGV signal before the driver can handle it. This interception prevents the driver from migrating the data back to the host, leading to a crash. The application terminates because the memory access violates the permissions set by the operating system.

This behavior highlights a critical challenge in memory management: the synchronization between host and device memory spaces. When memory is shared between the host and the GPU, ensuring that both have the correct permissions and access to the data at the right time is essential.

Workaround: Using Host Allocations

To circumvent the issues caused by page migration, a practical workaround is to use host allocations instead of shared allocations for managed memory. This approach involves using functions like zeMemAllocHost or clHostMemAllocINTEL. By allocating memory directly in the host's address space, the memory remains accessible from both the host and the device without the need for migration.

Benefits of Host Allocations

  1. Eliminates Segfaults: Host allocations ensure that the memory remains accessible from both the host and the device without migration issues. This prevents the ---s state and the subsequent segfaults that occur when the host tries to access migrated memory.
  2. Simplifies Debugging: By avoiding the complexities of page migration, host allocations can simplify debugging. Developers can reliably access memory from the host without the risk of triggering a signal handler and crashing the application.

Trade-offs: Performance Considerations

While host allocations provide a robust solution to the segfault issue, they come with a performance trade-off. When data stays in system memory instead of migrating to device memory, there is a potential performance penalty. Accessing system memory is generally slower than accessing memory on the GPU, which can impact the overall execution time of the application.

  1. Performance Penalty: Since the data resides in system memory, accessing it from the GPU involves transferring data over the PCI Express bus, which is slower than accessing local GPU memory.
  2. Memory Locality: Data locality is crucial for performance. When data is migrated to the GPU, computations can be performed with lower latency. Keeping data in system memory reduces this locality, potentially leading to increased execution times.

Balancing Performance and Stability

Developers must weigh the trade-offs between performance and stability when choosing between shared and host allocations. In scenarios where debugging is critical, or when running tests under frameworks like Catch2, using host allocations may be preferable to avoid crashes. However, in production environments where performance is paramount, shared allocations with careful memory management may be more appropriate.

Affected Tests and Reproducers

The issue of prefetch using host allocations and the resulting segfaults can affect various tests, particularly those that use hipMallocManaged followed by GPU kernel execution and host access. These tests are especially vulnerable when run under frameworks like Catch2, which install signal handlers that intercept the SIGSEGV signal.

Reproducer Code Example

The following code example demonstrates how the issue can be reproduced using OpenCL. This reproducer allocates memory using either shared memory allocation (clSharedMemAllocINTEL) or SVM allocation (clSVMAlloc), initializes it on the host, prefetches it to the device, runs a simple kernel, and then attempts to read the memory from the host. The key part of the reproducer is the setup of a segfault handler, which will catch the SIGSEGV signal if the memory access fails.

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#include <CL/cl_ext.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
#include <setjmp.h>

static jmp_buf jump_buffer;
static volatile sig_atomic_t got_segfault = 0;

void segfault_handler(int sig) {
    got_segfault = 1;
    longjmp(jump_buffer, 1);
}

#define CHECK_CL(err, msg) do { \
    if (err != CL_SUCCESS) { \
        fprintf(stderr, "OpenCL Error %d at %s:%d - %s\n", err, __FILE__, __LINE__, msg); \
        exit(1); \
    } \
} while(0)

// Minimal kernel that just reads from input and writes to output
const char* kernelSource = R"(
__kernel void copy(__global int* output, __global const int* input, int N) {
    int gid = get_global_id(0);
    if (gid < N) {
        output[gid] = input[gid];
    }
}
)";

typedef void* (*clSharedMemAllocINTEL_fn)(cl_context, cl_device_id,
    const cl_mem_properties_intel*, size_t, cl_uint, cl_int*);
typedef cl_int (*clMemFreeINTEL_fn)(cl_context, void*);

void print_mapping(void* ptr) {
    char cmd[256];
    snprintf(cmd, sizeof(cmd),
        "grep %lx /proc/%d/maps 2>/dev/null | head -1 || echo 'not found'",
        ((unsigned long)ptr) >> 12 << 12, getpid());
    system(cmd);
}

int main(int argc, char** argv) {
    int use_svm = (argc > 1 && strcmp(argv[1], "svm") == 0);

    printf("Intel OpenCL USM Shared Memory Bug Reproducer\n");
    printf("==============================================\n");
    printf("Allocation method: %s\n\n", use_svm ? "SVM (clSVMAlloc)" : "USM (clSharedMemAllocINTEL)");

    cl_int err;
    cl_platform_id platform = NULL;
    cl_device_id device = NULL;

    // Find Intel GPU
    cl_uint numPlatforms;
    clGetPlatformIDs(0, NULL, &numPlatforms);
    cl_platform_id* platforms = (cl_platform_id*)malloc(numPlatforms * sizeof(cl_platform_id));
    clGetPlatformIDs(numPlatforms, platforms, NULL);

    for (cl_uint i = 0; i < numPlatforms; i++) {
        char platName[256];
        clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(platName), platName, NULL);

        cl_uint numDevices;
        err = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);
        if (err == CL_SUCCESS && numDevices > 0) {
            platform = platforms[i];
            clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
            printf("Platform: %s\n", platName);
            break;
        }
    }
    free(platforms);

    if (!device) {
        fprintf(stderr, "No GPU found\n");
        return 1;
    }

    char deviceName[256], driverVersion[256];
    clGetDeviceInfo(device, CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    clGetDeviceInfo(device, CL_DRIVER_VERSION, sizeof(driverVersion), driverVersion, NULL);
    printf("Device: %s\n", deviceName);
    printf("Driver: %s\n\n", driverVersion);

    // Check for USM extension
    char extensions[8192];
    clGetDeviceInfo(device, CL_DEVICE_EXTENSIONS, sizeof(extensions), extensions, NULL);
    if (!use_svm && !strstr(extensions, "cl_intel_unified_shared_memory")) {
        fprintf(stderr, "USM extension not supported\n");
        return 1;
    }

    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
    CHECK_CL(err, "clCreateContext");

    cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, NULL, &err);
    CHECK_CL(err, "clCreateCommandQueue");

    // Build kernel
    cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, &err);
    CHECK_CL(err, "clCreateProgramWithSource");
    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);
    if (err != CL_SUCCESS) {
        char log[4096];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, sizeof(log), log, NULL);
        fprintf(stderr, "Build error: %s\n", log);
        return 1;
    }
    cl_kernel kernel = clCreateKernel(program, "copy", &err);
    CHECK_CL(err, "clCreateKernel");

    const int N = 8192;  // Larger allocation to ensure issue manifests
    const size_t size = N * sizeof(int);
    void* input = NULL;
    void* output = NULL;

    clSharedMemAllocINTEL_fn clSharedMemAllocINTEL = NULL;
    clMemFreeINTEL_fn clMemFreeINTEL = NULL;

    // Allocate memory
    printf("Step 1: Allocate memory\n");
    if (use_svm) {
        input = clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
        output = clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
        if (!input || !output) {
            fprintf(stderr, "SVM allocation failed\n");
            return 1;
        }
    } else {
        clSharedMemAllocINTEL = (clSharedMemAllocINTEL_fn)
            clGetExtensionFunctionAddressForPlatform(platform, "clSharedMemAllocINTEL");
        clMemFreeINTEL = (clMemFreeINTEL_fn)
            clGetExtensionFunctionAddressForPlatform(platform, "clMemFreeINTEL");

        input = clSharedMemAllocINTEL(context, device, NULL, size, 0, &err);
        CHECK_CL(err, "clSharedMemAllocINTEL (input)");
        output = clSharedMemAllocINTEL(context, device, NULL, size, 0, &err);
        CHECK_CL(err, "clSharedMemAllocINTEL (output)");
    }
    printf("   input=%p output=%p\n", input, output);
    printf("   Memory mapping: "); print_mapping(input);

    // Initialize from host - direct access for USM, mapped access for SVM
    printf("\nStep 2: Initialize memory from host\n");
    int* in = (int*)input;
    int* out = (int*)output;

    if (use_svm) {
        // SVM coarse-grain requires map/unmap
        clEnqueueSVMMap(queue, CL_TRUE, CL_MAP_WRITE, input, size, 0, NULL, NULL);
        clEnqueueSVMMap(queue, CL_TRUE, CL_MAP_WRITE, output, size, 0, NULL, NULL);
    }
    // USM shared memory can be accessed directly (that's the point of USM)
    for (int i = 0; i < N; i++) {
        in[i] = i;
        out[i] = 0;
    }
    if (use_svm) {
        clEnqueueSVMUnmap(queue, input, 0, NULL, NULL);
        clEnqueueSVMUnmap(queue, output, 0, NULL, NULL);
        clFinish(queue);
    }
    printf("   SUCCESS: Wrote %d elements, in[0]=%d\n", N, in[0]);
    printf("   Memory mapping: "); print_mapping(input);

    // Prefetch to device (like hipMemPrefetchAsync does)
    printf("\nStep 3: Prefetch to device\n");
    {
        const void* ptrs[] = {input};
        const size_t sizes[] = {size};
        cl_event ev;
        err = clEnqueueSVMMigrateMem(queue, 1, ptrs, sizes, 0 /*to device*/, 0, NULL, &ev);
        CHECK_CL(err, "clEnqueueSVMMigrateMem");
        clWaitForEvents(1, &ev);
        clReleaseEvent(ev);
        clFinish(queue);
    }
    printf("   Prefetch completed\n");
    printf("   Memory mapping: "); print_mapping(input);

    // Set kernel arguments and run
    printf("\nStep 4: Run kernel\n");
    if (use_svm) {
        void* ptrs[] = {input, output};
        clSetKernelExecInfo(kernel, CL_KERNEL_EXEC_INFO_SVM_PTRS, sizeof(ptrs), ptrs);
    } else {
        void* ptrs[] = {input, output};
        clSetKernelExecInfo(kernel, 0x4203 /*CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL*/, sizeof(ptrs), ptrs);
        // Also set indirect access flag like chipStar does
        cl_bool indirect = CL_TRUE;
        clSetKernelExecInfo(kernel, 0x4202 /*CL_KERNEL_EXEC_INFO_INDIRECT_SHARED_ACCESS_INTEL*/,
                           sizeof(cl_bool), &indirect);
    }
    clSetKernelArgSVMPointer(kernel, 0, output);
    clSetKernelArgSVMPointer(kernel, 1, input);
    clSetKernelArg(kernel, 2, sizeof(int), &N);

    size_t globalSize = N;
    cl_event ev;
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, &ev);
    CHECK_CL(err, "clEnqueueNDRangeKernel");
    clWaitForEvents(1, &ev);
    clReleaseEvent(ev);
    clFinish(queue);
    printf("   Kernel completed\n");
    printf("   Memory mapping: "); print_mapping(input);

    // Try to read from host - NO SVM MAP, just direct access like HIP managed memory
    printf("\nStep 5: Read memory from host (direct access, no clEnqueueSVMMap)\n");
    printf("   Memory mapping: "); print_mapping(input);

    // Set up segfault handler
    struct sigaction sa, old_sa;
    sa.sa_handler = segfault_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    sigaction(SIGSEGV, &sa, &old_sa);

    if (setjmp(jump_buffer) == 0) {
        printf("   Attempting to read input[0]... ");
        fflush(stdout);
        volatile int val = in[0];
        printf("SUCCESS: %d\n", val);
    } else {
        printf("SEGFAULT!\n");
    }

    // Restore handler
    sigaction(SIGSEGV, &old_sa, NULL);

    if (got_segfault) {
        printf("\n*** BUG REPRODUCED: USM memory inaccessible from host after kernel ***\n");
        return 1;
    }

    // Cleanup
    if (use_svm) {
        clSVMFree(context, input);
        clSVMFree(context, output);
    } else {
        clMemFreeINTEL(context, input);
        clMemFreeINTEL(context, output);
    }
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    printf("\nTEST PASSED\n");
    return 0;
}

This code demonstrates the issue by:

  1. Allocating shared memory using either clSharedMemAllocINTEL or clSVMAlloc.
  2. Initializing the memory on the host.
  3. Prefetching the memory to the device.
  4. Running a simple kernel that copies data.
  5. Attempting to read the memory from the host without mapping it back.
  6. Setting up a signal handler to catch SIGSEGV signals.

When the segfault handler is triggered, the code prints a message indicating that the bug has been reproduced. This reproducer effectively demonstrates the issue of USM memory becoming inaccessible from the host after kernel execution.

Detailed Explanation of the Reproducer

To fully understand the issue, let's break down the reproducer code step by step.

Step 1: Memory Allocation

The reproducer begins by allocating memory using either clSharedMemAllocINTEL (for USM) or clSVMAlloc (for SVM). The choice between these methods is controlled by the command-line argument. If the program is run with the svm argument, it uses SVM; otherwise, it uses USM.

if (use_svm) {
    input = clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
    output = clSVMAlloc(context, CL_MEM_READ_WRITE, size, 0);
    if (!input || !output) {
        fprintf(stderr, "SVM allocation failed\n");
        return 1;
    }
} else {
    clSharedMemAllocINTEL = (clSharedMemAllocINTEL_fn)
        clGetExtensionFunctionAddressForPlatform(platform, "clSharedMemAllocINTEL");
    clMemFreeINTEL = (clMemFreeINTEL_fn)
        clGetExtensionFunctionAddressForPlatform(platform, "clMemFreeINTEL");

    input = clSharedMemAllocINTEL(context, device, NULL, size, 0, &err);
    CHECK_CL(err, "clSharedMemAllocINTEL (input)");
    output = clSharedMemAllocINTEL(context, device, NULL, size, 0, &err);
    CHECK_CL(err, "clSharedMemAllocINTEL (output)");
}

Step 2: Memory Initialization

Next, the code initializes the allocated memory on the host. For USM, the memory can be accessed directly, while SVM requires mapping and unmapping the memory regions.

if (use_svm) {
    clEnqueueSVMMap(queue, CL_TRUE, CL_MAP_WRITE, input, size, 0, NULL, NULL);
    clEnqueueSVMMap(queue, CL_TRUE, CL_MAP_WRITE, output, size, 0, NULL, NULL);
}
for (int i = 0; i < N; i++) {
    in[i] = i;
    out[i] = 0;
}
if (use_svm) {
    clEnqueueSVMUnmap(queue, input, 0, NULL, NULL);
    clEnqueueSVMUnmap(queue, output, 0, NULL, NULL);
    clFinish(queue);
}

Step 3: Memory Prefetch

The code then prefetches the memory to the device, simulating the behavior of hipMemPrefetchAsync. This step is crucial for triggering the page migration issue.

{
    const void* ptrs[] = {input};
    const size_t sizes[] = {size};
    cl_event ev;
    err = clEnqueueSVMMigrateMem(queue, 1, ptrs, sizes, 0 /*to device*/, 0, NULL, &ev);
    CHECK_CL(err, "clEnqueueSVMMigrateMem");
    clWaitForEvents(1, &ev);
    clReleaseEvent(ev);
    clFinish(queue);
}

Step 4: Kernel Execution

A simple copy kernel is executed to simulate GPU processing of the memory. The kernel copies data from the input buffer to the output buffer.

clSetKernelArgSVMPointer(kernel, 0, output);
clSetKernelArgSVMPointer(kernel, 1, input);
clSetKernelArg(kernel, 2, sizeof(int), &N);

size_t globalSize = N;
cl_event ev;
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, &ev);
CHECK_CL(err, "clEnqueueNDRangeKernel");
clWaitForEvents(1, &ev);
clReleaseEvent(ev);
clFinish(queue);

Step 5: Host Access and Segfault Handling

Finally, the code attempts to read the memory from the host without mapping it back (in the case of SVM). A segfault handler is set up to catch the SIGSEGV signal if the memory access fails. This is where the issue manifests: the host memory is inaccessible due to the page migration, and the signal handler is triggered.

struct sigaction sa, old_sa;
sa.sa_handler = segfault_handler;
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
sigaction(SIGSEGV, &sa, &old_sa);

if (setjmp(jump_buffer) == 0) {
    printf("   Attempting to read input[0]... ");
    fflush(stdout);
    volatile int val = in[0];
    printf("SUCCESS: %d\n", val);
} else {
    printf("SEGFAULT!\n");
}

sigaction(SIGSEGV, &old_sa, NULL);

Conclusion

The issue of prefetch operations triggering host allocations due to page migration can lead to significant challenges in memory management, especially in heterogeneous computing environments. By understanding the root cause of these problems and implementing appropriate workarounds, developers can mitigate the risk of crashes and ensure the stability of their applications. While using host allocations provides a reliable solution, it's essential to consider the performance trade-offs and choose the memory allocation strategy that best fits the application's requirements.

For further reading on memory management in OpenCL and related topics, consider exploring resources such as the Khronos Group's OpenCL documentation.