This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality (i.e. I’m still learning, I’m sharing what I’ve found to work). Learning by example works best for me so make sure to get the code! It can only help you to have a copy of the OpenCL specification handy, and it doesn’t hurt (too bad) to read it!
NOTE: If you are interested in using the C++ Bindings, I recommend you check out my revision of this tutorial.
My code works for me on my Macbook Pro (with Geforce 9400M) running Snow Leopard with the NVIDIA GPU SDK as well as on the Ubuntu 10.4 workstations (with GTX 480 or Geforce 8800GTX). Unfortunately I haven’t spent any time developing on Windows so for now my tutorials will be UNIX centric (I would LOVE any help in setting up a windows environment, I’ll need to eventually for my Blender project). Also the code should build against the ATI Stream SDK and run on the runtime, you can assume I’m complying with OpenCL 1.0 (1.1 may be covered in the more advanced topics, and I’ll point it out). Please let me know if you have build problems or device issues! Throughout the tutorial I will refer to device and GPU interchangeably, OpenCL can run on CPUs already and is targeting many other devices, but for now I’m assuming a GPGPU bias.
Let’s get started!
You’ll need to have installed:
NVIDIA or ATI GPU SDK and OpenCL enabled drivers
(Ubuntu ATI users might like extra guidance)
CMake (introduction and in-depth tutorial [pdf])
and it helps to have Git (introduction and nice book)
Download the code and for the rest of the tutorial I will refer to the directory it’s in as advcl
In the advcl directory you should have the following directories:
part1/ //the source code files for this tutorial part1.5/ //the source code for the part1.5 (C++ bindings) cmake/ //CMake scripts that help locate necessary libraries opencl10/ //OpenCL 1.0 header files (downloaded from Khronos.org) opencl11/ //OpenCL 1.1 header files (downloaded from Khronos.org)
First we will build the code to make sure it works, I like to do an “out of source” build like so:
cd part1 mkdir build cd build cmake .. make
This will generate all of the build files, the Makefile, the library and executable in the build directory which avoids cluttering up your source code directory. You can run the example like so:
./part1.x
The Source Code Files
Let’s first have a broad overview of each of the source files and then we can dive in and look at what the code is doing. I’ve the code up to be a library with a CL class that can be instantiated and utilized anywhere. For this tutorial it is not very generalized so that it’s easier to see what’s going on behind the scenes. In future tutorials we will refactor and make our library more powerful!
main.cpp
This is where we test out our CL class. We instantiate it, give it an opencl program to compile and run, then execute the kernel.
cll.h
The main header file for our CL class definition, also handles including the OpenCL libraries on both Linux and Mac.
cll.cpp
The core implementation of our CL class, including functions for initializing the OpenCL context, loading and building an OpenCL program and cleaning up the GPU memory we used.
part1.cpp
Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.
part1.cl
The actual OpenCL code to be executed. Right now it’s a simple kernel that adds two arrays and stores the result in a third.
util.h and util.cpp
Utility functions that make things like reading files or printing out OpenCL error messages easier
CMakeLists.txt
The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.
The Source Code Contents
Let’s follow the execution of the main function to see what order things need to be done in. Of course we need to include our library’s definitions:
#include "cll.h"
This includes the OpenCL headers as well as defines our CL class. If you look there you will see a few public members of type cl_mem which will point to arrays on our device. The private cl_* members are handles to key OpenCL objects which we will use in the constructor to setup OpenCL for execution.
The constructor is defined in cll.cpp where it does a few things: sets the platform, sets the device to use, creates the OpenCL context and a command queue. Let’s see how its done:
err = oclGetPlatformID(&platform);
printf("oclGetPlatformID: %s\n", oclErrorString(err));
The oclGetPlatformID function is a helper function defined in util.cpp I lifted from the NVIDIA SDK (I figure they won’t mind for educational purposes but it’s not under a free license so be careful!). This function looks for the NVIDIA platform and defaults to the first available one if its not found (“Apple” is the only platform on my Mac). The 2nd line is something you will see peppered throughout the code, oclErrorString(err) is another useful NVIDIA helper function that gives you a little more info about what went wrong if you have a problem. You can get some more explanation of the different error types if you go to the khronos specification for the function that errored, for example clGetDeviceIDs (a quick google search for the function name generally returns the khronos page first).
After we set the platform, we use it to select a device. In this code we get a list of the available devices but we end up just choosing the first in the list. This could of course become more sophisticated with multiple graphics cards, and in another tutorial we will discuss checking for device capabilities.
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices); devices = new cl_device_id [numDevices]; err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL); deviceUsed = 0;
Note that we hardcoded the device type to be GPU, you can check the specification for the others but I’ll give you a hint that one of them starts with CL and ends with CPU ;)
Now we can make an OpenCL context using our device (the more advanced options like sharing an OpenGL context will come in a following tutorial):
context = clCreateContext(0, 1, &devices[deviceUsed], NULL, NULL, &err);
and the last action of the constructor is to create the command queue
command_queue = clCreateCommandQueue(context, devices[deviceUsed], 0, &err);
Once the constructor is finished, we want to load our OpenCL program, so in main.cpp we call
example.loadProgram("part1.cl");
loadProgram is defined in cll.cpp, it simply reads in the .cl file as a string and passes the string to
program = clCreateProgramWithSource(context, 1, (const char **) &cSourceCL, &program_length, &err);
Here we load only one string of program_length characters, but you could do more at once.
Once the program is created, it must also be compiled so we call the private function buildExecutable to do that. It has some nice error checking that will output the build log if you have syntax errors in your OpenCL code. I won’t detail it here but check out the bottom of the cll.cpp file.
Once the program is loaded and built we are ready to pass the data to our device and do some computing! Let’s take a quick look at our simple kernel in part1.cl so we have an idea of what we are trying to do:
__kernel void part1(__global float* a, __global float* b, __global float* c)
{
unsigned int i = get_global_id(0);
c[i] = a[i] + b[i];
}
Since I’m still forming my understanding of OpenCL I’ll avoid explaining the details (you and me both gotta read the spec! or check out some of NVIDIA’s nice webinars). This serves as a practical starting point for you to write your own kernels, so lets see what I’m doing here.
I define my kernel as part1 and it takes in three paramaters. The __global keyword says what kind of device memory our input is stored in, and besides that we should already be familiar with passing arrays to functions as a pointer.
One way to think of the kernel is as a replacement for a for loop (you can do much cooler things, but we gotta start somewhere). We get the index in the for loop from the get_global_id built in function, and then we do our operation on the arrays at that index. We let OpenCL split up the arrays into work-units and it will try to do as many of them as possible in parallel.
So how do we give OpenCL our arrays and tell it to do the work? That’s all in part1.cpp
We setup our kernel with the popCorn function (hehe)
First we go ahead and call
kernel = clCreateKernel(program, "part1", &err);
Then we setup the arrays we want to work on, I created a trivial example as you can see in the file. The important part is creating the OpenCL array buffers and pushing the data to the device:
//our input arrays cl_a = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, sizeof(float) * num, a, &err); cl_b = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * num, NULL, &err); //our output array cl_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * num, NULL, &err);
Note that our input are defined as read buffers, and our output is a write, there is also CL_MEM_READ_WRITE, and these are suggestions to the device on how you will use the buffers so it can optimize for performance. For the a array the clCreateBuffer function copies the data from the CPU to the GPU for us. If you want to separate creating the buffer from pushing the data you can do the data push as a separate call:
err = clEnqueueWriteBuffer(command_queue, cl_b, CL_TRUE, 0, sizeof(float) * num, b, 0, NULL, &event);
Then we tell the kernel which buffers correspond to which arguments
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &cl_a); err = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &cl_b); err = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &cl_c);
You can also pass in other types of arguments, but we should see that in the next example.
We introduce
clFinish(command_queue);
Which makes sure that all commands in the queue are done executing before the program continues. This will be more important in later tutorials where we run our kernel in a loop as fast as possible.
For now we just run it once in the runKernel() function:
err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, workGroupSize, NULL, 0, NULL, &event);
The workGroupSize is defined to be the same size as our data arrays, and so it is 1 dimensional. This will be much more interesting with more complex kernels, but for now we just make them the same.
Finally, we read from our device memory to see if the c array got updated the way we expected!
float *c_done; err = clEnqueueReadBuffer(command_queue, cl_c, CL_TRUE, 0, sizeof(float) * num, c_done, 0, NULL, &event);
now we can simply print out the contents of the c_done array!
for(int i=0; i < num; i++)
{
printf("c_done[%d] = %g\n", i, c_done[i]);
}
Hopefully the code and this walk-through give you a good starting point to learn more. Expect more tutorials, with the next one involving the OpenCL Profiler and determining device information, followed by OpenCL/OpenGL context sharing.
I definitely appreciate any feedback!
very good tutorial , waiting for your next tutos hope it’s going to be soon.
I can highly recommend using the C++ bindings instead of the plain C ones for OpenCL. They are much friendlier IMHO. You can find them here:
http://www.khronos.org/registry/cl/
(it’s a single header file only)
Pingback: Adventures in OpenCL: Part 1.5, C++ Bindingsenj | enj
Thanks for the suggestion Paul!
I’ve tried it out and re-written this tutorial using the C++ bindings, and you’re right I find them much cleaner as well.
http://enja.org/2010/07/20/adventures-in-opencl-part-1-5-cpp-bindings/
Hi,
I am a beginner using cuda and opencl and I’ve tried to setup correctly cmake on my computer, but I am getting an error
“CMake Error: The following variables are used in this project, but they are set
to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake file
s:
OPENCL_LIBRARIES
linked by target “part1.x” in directory C:/Dev/advcl/part1″
I have a quadro FX 3800M with Notebook Developer Drivers 257.21.
I rebooted after the installation.
Do I need to set something special? Is OPENCL_LIBRARIES not set correctly by cmake?
Thanks
Hey Cedric,
Unfortunately I don’t have windows handy to test right now. You can look in the cmake/FindOPENC.cmake file to see how it is searching for OPENCL_LIBRARIES
You could try finding out where OpenCL.lib or OpenCL.dll is on your system and adding a LINK_LIBARIES(C:/path/to/OpenCL.lib) into the CMakeLists.txt in part1/
wish I could be more help
Ian
Pingback: Adventures in OpenCL Part 2: Particles with OpenGL | enj