This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality. This part is a reworking of my first tutorial using the OpenCL C++ Bindings. Learning by example works best for me so make sure to get the code! It can only help you to have a copy of the OpenCL specification handy, and it doesn’t hurt (too bad) to read it!
My code works for me on my Macbook Pro (with Geforce 9400M) running Snow Leopard with the NVIDIA GPU SDK as well as on the Ubuntu 10.4 workstations (with GTX 480 or Geforce 8800GTX). Unfortunately I haven’t spent any time developing on Windows so for now my tutorials will be UNIX centric (I would LOVE any help in setting up a windows environment, I’ll need to eventually for my Blender project). Also the code should build against the ATI Stream SDK and run on the runtime, you can assume I’m complying with OpenCL 1.0 (1.1 may be covered in the more advanced topics, and I’ll point it out). Please let me know if you have build problems or device issues! Throughout the tutorial I will refer to device and GPU interchangeably, OpenCL can run on CPUs already and is targeting many other devices, but for now I’m assuming a GPGPU bias.
Let’s get started!
You’ll need to have installed:
NVIDIA or ATI GPU SDK and OpenCL enabled drivers
(Ubuntu ATI users might like extra guidance)
CMake (introduction and in-depth tutorial [pdf])
and it helps to have Git (introduction and nice book)
Download the code and for the rest of the tutorial I will refer to the directory it’s in as advcl
In the advcl directory you should have the following directories:
part1.5/ //the source code files for this tutorial cmake/ //CMake scripts that help locate necessary libraries opencl10/ //OpenCL 1.0 header files (downloaded from Khronos.org) opencl11/ //OpenCL 1.1 header files (downloaded from Khronos.org)
First we will build the code to make sure it works, I like to do an “out of source” build like so:
cd part1.5 mkdir build cd build cmake .. make
This will generate all of the build files, the Makefile, the library and executable in the build directory which avoids cluttering up your source code directory. You can run the example like so:
./part1.x
The Source Code Files
Let’s first have a broad overview of each of the source files and then we can dive in and look at what the code is doing. I’ve the code up to be a library with a CL class that can be instantiated and utilized anywhere. For this tutorial it is not very generalized so that it’s easier to see what’s going on behind the scenes. In future tutorials we will refactor and make our library more powerful!
main.cpp
This is where we test out our CL class. We instantiate it, give it an opencl program to compile and run, then execute the kernel.
cll.h
The main header file for our CL class definition, also handles including the OpenCL libraries. I’ve downloaded the header files from the Khronos website to avoid having to search the computer for a particular SDK.
cll.cpp
The core implementation of our CL class, including functions for initializing the OpenCL context, loading and building an OpenCL program.
part1.cpp
Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.
part1.cl
The actual OpenCL code to be executed. Right now it’s a simple kernel that adds two arrays and stores the result in a third.
util.h and util.cpp
Utility functions that make things like reading files or printing out OpenCL error messages easier
CMakeLists.txt
The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.
The Source Code Contents
Let’s follow the execution of the main function to see what order things need to be done in. Of course we need to include our library’s definitions:
#include "cll.h"
This includes the OpenCL headers as well as defines our CL class. If you look there you will see a few public member objects of type cl::Buffer which will point to arrays on our device. The private cl::* member objects are key OpenCL objects which we will use in the constructor to setup OpenCL for execution.
The constructor is defined in cll.cpp where it does a few things: sets the platform, sets the device to use, creates the OpenCL context and a command queue. Let’s see how its done:
std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); printf("cl::Platform::get(): %s\n", oclErrorString(err));
This is the first example of the nice cl wrapper class specified in the C++ bindings where we get the device Platforms. As you can see it is much cleaner than the previous code. The 3rd line is something you will see peppered throughout the code, oclErrorString(err) is a useful helper function I lifted from the NVIDIA SDK that gives you a little more info about what went wrong if you have a problem. You can get some more explanation of the different error types if you go to the khronos specification for the function that errored, for example clGetDeviceIDs (a quick google search for the function name generally returns the khronos page first).
After we set the platform (for now we just select the first platform in the list), we use it to create a context. For this tutorial we choose the GPU to be the device. This could of course become more sophisticated with multiple graphics cards or other devices, and in another tutorial we will discuss checking for device capabilities.
cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; context = cl::Context(CL_DEVICE_TYPE_GPU, properties); devices = context.getInfo<CL_CONTEXT_DEVICES>();
and the last action of the constructor is to create the command queue
deviceUsed = 0; try{ queue = cl::CommandQueue(context, devices[deviceUsed], 0, &err); } catch (cl::Error er) { printf("ERROR: %s(%d)\n", er.what(), er.err()); }
Here you see some use of exceptions provided by cl.hpp, note it is necessary to set
#define __CL_ENABLE_EXCEPTIONS
as I did at the top of cll.h.
Once the constructor is finished, we want to load our OpenCL program, so in main.cpp we do:
#include "part1.cl" example.loadProgram("part1.cl");
You may wonder why we have an include here, I find it a nice way to bundle our OpenCL source with our library using a neat little macro trick.
loadProgram is defined in cll.cpp:
int pl; cl::Program::Sources source(1, std::make_pair(kernel_source,pl)); program = cl::Program(context, source);
Once the program is created, it must also be compiled. It also helps to have some compiler output incase we have syntax errors (never!).
err = program.build(devices); printf("program.build: %s\n", oclErrorString(err)); if(err != CL_SUCCESS){ std::cout << "Build Status: " << program.getBuildInfo<CL_PROGRAM_BUILD_STATUS>(devices[0]) << std::endl; std::cout << "Build Options:\t" << program.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(devices[0]) << std::endl; std::cout << "Build Log:\t " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]) << std::endl; }
Once the program is loaded and built we are ready to pass the data to our device and do some computing! Let’s take a quick look at our simple kernel in part1.cl so we have an idea of what we are trying to do:
__kernel void part1(__global float* a, __global float* b, __global float* c) { unsigned int i = get_global_id(0); c[i] = a[i] + b[i]; }
Since I’m still forming my understanding of OpenCL I’ll avoid explaining the details (you and me both gotta read the spec! or check out some of NVIDIA’s nice webinars). This serves as a practical starting point for you to write your own kernels, so lets see what I’m doing here.
I define my kernel as part1 and it takes in three paramaters. The __global keyword says what kind of device memory our input is stored in, and besides that we should already be familiar with passing arrays to functions as a pointer.
One way to think of the kernel is as a replacement for a for loop (you can do much cooler things, but we gotta start somewhere). We get the index in the for loop from the get_global_id built in function, and then we do our operation on the arrays at that index. We let OpenCL split up the arrays into work-units and it will try to do as many of them as possible in parallel.
So how do we give OpenCL our arrays and tell it to do the work? That’s all in part1.cpp
We setup our kernel with the popCorn function (hehe)
First we go ahead and call it (with error checking)
try{ kernel = cl::Kernel(program, "part1", &err); } catch (cl::Error er) { printf("ERROR: %s(%d)\n", er.what(), er.err()); }
Notice that the string we pass in is the name of the kernel as we defined it in the .cl file.
Then we setup the arrays we want to work on, I created a trivial example as you can see in the file. The important part is creating the OpenCL array buffers and pushing the data to the device:
size_t array_size = sizeof(float) * num; //our input arrays cl_a = cl::Buffer(context, CL_MEM_READ_ONLY, array_size, NULL, &err); cl_b = cl::Buffer(context, CL_MEM_READ_ONLY, array_size, NULL, &err); //our output arrayw cl_c = cl::Buffer(context, CL_MEM_WRITE_ONLY, array_size, NULL, &err);
Note that our input are defined as read buffers, and our output is a write, there is also CL_MEM_READ_WRITE, and these are suggestions to the device on how you will use the buffers so it can optimize for performance.
Pushing the data is the same for each one:
err = queue.enqueueWriteBuffer(cl_a, CL_TRUE, 0, array_size, a, NULL, &event);
Then we tell the kernel which buffers correspond to which arguments
err = kernel.setArg(0, cl_a); err = kernel.setArg(1, cl_b); err = kernel.setArg(2, cl_c);
You can also pass in other types of arguments, but we should see that in the next installment.
We introduce
queue.finish();
Which makes sure that all commands in the queue are done executing before the program continues. This will be more important in later tutorials where we run our kernel in a loop as fast as possible.
For now we just run it once in the runKernel() function:
err = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(num), cl::NullRange, NULL, &event);
The cl::NDRange(num) is the global workgroup size, and is one-dimensional like our arrays.
Finally, we read from our device memory to see if the c array got updated the way we expected!
float *c_done; err = queue.enqueueReadBuffer(cl_c, CL_TRUE, 0, sizeof(float) * num, c_done, NULL, &event);
now we can simply print out the contents of the c_done array!
for(int i=0; i < num; i++) { printf("c_done[%d] = %g\n", i, c_done[i]); }
Hopefully the code and this walk-through give you a good starting point to learn more. Expect more tutorials, with the next one involving the OpenCL Profiler and determining device information, followed by OpenCL/OpenGL context sharing.
I definitely appreciate any feedback!
Hey thanks for the tutorial . . .
It seems like you’re missing some memory management in part1.cpp. In runKernel(), you need to allocate memory for the result array:
float *c_done = new float[num];
Without that you will probably get an access violation (I do with Visual Studio on windows).
You should also probably delete that array and the others in popCorn():
delete [] a;
delete [] b;
delete [] c;
Pingback: Adventures in OpenCL Part 2: Particles with OpenGL | enj
Pingback: Adventures in OpenCL: Part 1, Getting Started | enj
arrrgh why cant i escape arrays :)
doing this now on open CL compiled in C# :)
Hi dear , Thanks for your code and guidance , I tried to compile and run main.cpp , part2.cpp but it’s giving lots of errors . Was the code complete or need corrections ?
It almost work out of the box on Ubuntu Maverick Meerkat with CUDA drivers enabled. The only modification I had to do was to replace
exit(0);
in your main.cpp by a
return 0;
:)
Pingback: Adventures in OpenCL Part 3: Constant Memory Structs | enj
Hey,
Really nice tutorial. Well done! :)
I am doing research for technology that I should use in voxel based terrain renderer and OpenCL seems like way to go.
The probable answer to this is a flat out – NO, but is it possible to create new memory elements on the GPU and pass them back to the host? I am thinking about a situration where new vertexes might need to be created (e.g. tesselation of 3D objects).
Thanks for your tutorial, I’m about to use OpenCL in C++ for my Bachelor’s thesis and I’m glad I found your site :)
I think I understood everything, but one question remains: Why do you split the implementation of the functions in cll.h across the two files cll.cpp and part1.cpp ?
Thanks for your tutorial. I was looking for a quick start in OpenCL and this really helped me forward.