Adventures in OpenCL Part 3: Constant Memory Structs

Today we make a brief stop on our journey to examine a technique which has proven useful but not straightforward; passing in a structure as a parameter to an OpenCL kernel as __constant memory. Why would you want to do this? Allow me to spin you a tale of 8 kernels which share an assortment of around 30 parameters many of which are either configurable by the user or can change at runtime based on user input. Throw in the fact that a complex program like this requires plenty of debugging and who knows which parameters will end up at the finish line and which will be left by the wayside. So rather than specify an assortment of variables to pass to each kernel, we define a structure to hold our colorful collection of floats and ints and just throw the whole bundle at each kernel. This is of course convenient for us programmers, at least once structs in OpenCL make some sense to us, and as a bonus it turns out to be a good idea in terms of memory usage as well. So lets get into it!

Grab the code

You can follow along in C++ or Python

This is just a slight adaption of my Part 1 (C++, Python) tutorial code, so check those out if you need help getting started.

So lets take a look at what we want to achieve in OpenCL:

typedef struct Params
{
    float A;
    float B;
    int C;
} Params;

__kernel void part3(__global const float *a,
                  __global const float *b,
                  __global float *c,
                  __constant struct Params* test
                  )
{
    int gid = get_global_id(0);
    c[gid] = test->A * a[gid] + test->B * b[gid] + test->C;
}

You can see on line 16 that all we are doing here is a simple sum, but multiplying each number by a (float) parameter and adding an (int) parameter. You may already have noticed something weird going on in line 12, why is the Params struct a pointer? Excellent question. Even though we only want one structure, we still need to pass it in as a buffer to appease OpenCL (NOTE: This is a problem we’ve come across, we’d love to see sample code which allowed straightforward passing of a struct). So it’s just going to pass a bunch of binary data to the kernel, and then the kernel will have to structure the data when it gets there. And as far as I can tell the only way to get a buffer into a kernel is by passing it in with the pointer syntax you see on line 12 here.
Passing a struct as a buffer is not much different from passing a regular array in C++:

Params params;
params.A = .5f;
params.B = 10.0f;
params.C = 3;
cl_params = cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(Params), NULL, &err);
err = queue.enqueueWriteBuffer(cl_params, CL_TRUE, 0, sizeof(Params), &params, NULL, &event);

In Python (more on the struct module)

params = struct.pack('ffffi', .5, 10., 0., 0., 3)
params_buf = cl.Buffer(ctx, mf.READ_ONLY, len(params))
cl.enqueue_write_buffer(queue, params_buf, params).wait()

And that’s all it takes to have access to our structure in OpenCL!

Update: Please see David Garcias comment below for a correction of this section. A quick experiment showed that it’s not necessary to have padding in this case. I still find this a confusing issue and it would be nice to see some self-contained example code showing cases where alignment breaks.

So about that float padding[2]; in the struct definition. This is because of memory alignment in OpenCL. The best explanation I’ve seen so far is by AndreasStahl which I will briefly summarize in relation to the struct above.
When interpreting a struct, OpenCL accesses the memory in blocks of 16 bytes, which is the same as 4 floats (each 4 bytes). So in our example if we did not have the padding, we would not be able to access our int because opencl would have interpreted it as the 3rd float out of the first 16bytes. This can get even more complicated if you have an array of structs, because then the size of you’re struct will need to be a multiple of 16, as explained in the linked forum post.

So lets talk a little bit about __constant memory. On a GPU this memory is read-only, cached and usually around 64kb per multiprocessor (you can query this, for an example look to the CLInfo example in the AMD SDK and oclDeviceQuery in the NVIDIA SDK). It is closer to the processor than __global and much faster to access, while slightly slower than __local memory, so if you are passing in a ton of them you may be eating into the precious few (48) kilobytes you have, so moving them to __constant memory can free up a little space as well.

Additionally, at least on some implementations there seems to be an arbitrary limit of 9 constant (non-buffer) parameters, which using a struct will help you avoid. (NOTE: as David points out below, the limit is not arbitrary. It can be queried through clGetDeviceInfo() and CL_DEVICE_MAX_CONSTANT_ARGS)

What about other people’s approaches? What kind of trouble have you run into with structs? kernel parameters?

9 thoughts on “Adventures in OpenCL Part 3: Constant Memory Structs

  1. Bunt

    Hey enja great tutorials. Is there any reason why you switched from Python back to C++ though, did you find any particular drawbacks in working with Python and PyOpenCL? Cheers, Bunt, BVI

  2. enj Post author

    Hey Bunt, I didn’t switch, this one has the same code in both languages. My main project is still in C++ so I will have to focus on it. I still prefer PyOpenCL for development though, and I’ve been playing in it recently. From now on I will probably be doing my tutorials in both languages.

  3. David Garcia

    I believe there are some inaccuracies in the article.

    Even though we only want one structure, we still need to pass it in as a buffer to appease OpenCL.

    I don’t think there’s anything in the OpenCL spec preventing you to pass a struct as a kernel argument. For instance, section 5.7.2 lists how to pass different types of arguments to clSetKernelArg. After listing all built-in types and pointers, it says “For all other kernel arguments, the arg_value entry must be a pointer to the actual data to be used as argument value.” Additionally, section 6.8-p reads “Arguments to __kernel functions that are declared to be a struct do not allow OpenCL objects to be passed as elements of the struct”, which implies that it’s okay to pass structs as kernel arguments.

    When interpreting a struct, OpenCL accesses the memory in blocks of 16 bytes, which is the same as 4 floats (each 4 bytes).

    That is not correct either. What OpenCL requires is that each struct member must be naturally aligned. For example, a float variable, since it has a size of 4 bytes, must be aligned to a 4-byte boundary. Your example struct could be defined just fine as

    typedef struct Params
    {
    float A;
    float B;
    int C;
    } Params;

    Keep in mind that, in accordance to C99 rules (of which OpenCL C is a derivative), compilers are free to insert padding between struct members and at the end of the struct. It wouldn’t be surprising at all if the particular compiler you are using has decided to pad the struct size to 16 bytes.

    If you want to pass structs to kernels it makes sense to specify alignment attributes, as explained in section 6.10.1 rather than attempting to guess what the particular OpenCL compiler you have installed is doing. Inserting hand-crafted padding members means that your program may not work correctly in a different computer.

    Additionally, at least on some implementations there seems to be an arbitrary limit of 9 constant (non-buffer) parameters, which using a struct will help you avoid.

    This limit is not arbitrary. It can be queried through clGetDeviceInfo() and CL_DEVICE_MAX_CONSTANT_ARGS.

    If you have any other doubts or questions about the OpenCL standard, please refer to the Khronos message boards where people will be happy to help.

  4. enj Post author

    Hello David,
    Thank’s for pointing these out, I will revise my post with some of your corrections, after making sure I have a grasp on my misunderstandings.
    The alignment issue is certainly a case of me not having the whole story, I have been telling the compiler to align the structs in my project’s code to 16 bytes (with __attribute__(alligned(16)) keyword, or #pragma pack(16) on windows). I read because of the fact that compilers can arbitrarily add padding it was a good idea to specify. I left that out of the tutorial, but perhaps I should put it back after double checking this. It helps me to try it out for myself.

    As far as passing in a struct not as an array, I have tried many different combinations to no avail. If someone can give me an example which works I will gladly update the tutorial but we’ve had to play this trick finding no alternative.

    Also thanks for pointing out the constant args device info, that’s good to know. 9 still seems rather small.

    Thanks again
    Ian

  5. enj Post author

    @David
    I just did a quick test, if I don’t have padding it works whether I specify the alignment or not, so I will remove that part from the post until I find a use for it.

  6. David Garcia

    As far as passing in a struct not as an array, I have tried many different combinations to no avail.

    I looked at the OpenCL conformance tests and couldn’t find any place that tests that feature. It’s possible that it’s broken in some implementations :(

  7. Emanuel Ey

    So i have been playing around with passing structures to OpenCL kernels as well, and I’ve actually been able to pass structs directly (i.e., without a pointer).
    I did this in C, but it should be easy to adapt for C++.

    Here are the most relevant points from the host code:

    typedef struct{
    cl_uchar cDist;
    cl_uchar cClass
    float y[N_POINTS_DEPTH];
    float x[N_POINTS_RANGE];
    float c1D[N_POINTS_DEPTH];
    }myStruct_t;

    The test kernel takes only 2 arguments, the struct as for input, and another struct of the same type to hold some test data computed from the input. Here’s the host-side memory allocation:

    //instantiate a struct for input:
    myStruct_t a;
    a.y[0] = 1.4;
    a.y[1] = 2.5;

    //allocate memory for output:
    cl_mem out = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(myStruct_t), NULL, &errNum);
    if(errNum != CL_SUCCESS){
    fatal(clError(errNum));
    }else{
    DEBUG(2, "Successfully allocated memory for output.\n");
    }

    Then, set the kernel args:

    errNum = clSetKernelArg(kernel, 0, sizeof(myStruct_t), &a);
    errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &out);
    if (errNum != CL_SUCCESS){
    fprintf(stderr, "Error setting kernel arguments.\n");
    fatal(clError(errNum));
    exit(EXIT_FAILURE);
    }else{
    DEBUG(2, "Successfully defined kernel arguments.\n");
    }

    The kernel:

    __kernel void testStructs( private myStruct_t soundIn,
    global myStruct_t *result){
    local myStruct_t ssp;

    ssp.y[0] = 2*soundIn.y[0];
    ssp.y[1] = 2*soundIn.y[1];

    *result = ssp;
    }

    So from my tests i figured out that apparently you cannot actually use the ‘constant’ qualifier when passing in a struct, it has to be ‘private’. This is ok for me, since ‘constant’ is a special type of ‘global’ and reading it comes at a performance penalty anyway. To me it makes sense to have a settings struct in the fastest available memory.
    I tested this on an Nvidia GPU with OpenCL 1.1 and on a Intel Ivy Bridge CPU also with a OpenCL 1.1 implementation.

  8. Max

    I just wanted to comment on “Emanuel Ey” regarding his way of passing a struct by value into private memory.

    First of, the only way the struct will actually end up in private memory is by copying it from global or constant memory. Second, every thread has its own private memory so the struct will be duplicated for each thread wasting memory space. Third, a struct with array components that are accessed in a dynamic way (through a pointer for example) cannot be stored in registers anyways, so it will reside in global memory.
    Making things worse is that private memory that is spilled to global memory will not be cached because it makes no sense to do so, since each thread has its own copy.

    Regarding: “This is ok for me, since ‘constant’ is a special type of ‘global’ and reading it comes at a performance penalty anyway.”
    Yes it is global memory but there is a special constant cache (usually 8kb) on each multiprocessor which is as fast as L1 cache.
    So storing the struct in constant memory is in fact the most efficient way (those GPU designers have done their homework after all).

    Please don’t take my comment the wrong way, i’ve done some crazy “optimizations” myself in the past :)

  9. Balthazar

    I see there is a discussion about passing structs as arguments, and alignment. What works on one computer/compiler might not work on another. I see that in the cl.h file from Khronos there are alignment attributes added to the definitions of types like cl_float. It may be best to just use those types for structs that are passed between the host and the kernel.

Leave a Reply

Your email address will not be published. Required fields are marked *