Adventures in OpenCL: Part 1, Getting Started

This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality (i.e. I’m still learning, I’m sharing what I’ve found to work). Learning by example works best for me so make sure to get the code! It can only help you to have a copy of the OpenCL specification handy, and it doesn’t hurt (too bad) to read it!

NOTE: If you are interested in using the C++ Bindings, I recommend you check out my revision of this tutorial.

My code works for me on my Macbook Pro (with Geforce 9400M) running Snow Leopard with the NVIDIA GPU SDK as well as on the Ubuntu 10.4 workstations (with GTX 480 or Geforce 8800GTX). Unfortunately I haven’t spent any time developing on Windows so for now my tutorials will be UNIX centric (I would LOVE any help in setting up a windows environment, I’ll need to eventually for my Blender project). Also the code should build against the ATI Stream SDK and run on the runtime, you can assume I’m complying with OpenCL 1.0 (1.1 may be covered in the more advanced topics, and I’ll point it out). Please let me know if you have build problems or device issues! Throughout the tutorial I will refer to device and GPU interchangeably, OpenCL can run on CPUs already and is targeting many other devices, but for now I’m assuming a GPGPU bias.

Let’s get started!

You’ll need to have installed:

NVIDIA or ATI GPU SDK and OpenCL enabled drivers
(Ubuntu ATI users might like extra guidance)
CMake (introduction and in-depth tutorial [pdf])
and it helps to have Git (introduction and nice book)

Download the code and for the rest of the tutorial I will refer to the directory it’s in as advcl

In the advcl directory you should have the following directories:

part1/       //the source code files for this tutorial
part1.5/     //the source code for the part1.5 (C++ bindings)
cmake/       //CMake scripts that help locate necessary libraries
opencl10/    //OpenCL 1.0 header files (downloaded from
opencl11/    //OpenCL 1.1 header files (downloaded from

First we will build the code to make sure it works, I like to do an “out of source” build like so:

cd part1
mkdir build
cd build
cmake ..

This will generate all of the build files, the Makefile, the library and executable in the build directory which avoids cluttering up your source code directory. You can run the example like so:


The Source Code Files

Let’s first have a broad overview of each of the source files and then we can dive in and look at what the code is doing. I’ve the code up to be a library with a CL class that can be instantiated and utilized anywhere. For this tutorial it is not very generalized so that it’s easier to see what’s going on behind the scenes. In future tutorials we will refactor and make our library more powerful!

This is where we test out our CL class. We instantiate it, give it an opencl program to compile and run, then execute the kernel.

The main header file for our CL class definition, also handles including the OpenCL libraries on both Linux and Mac.

The core implementation of our CL class, including functions for initializing the OpenCL context, loading and building an OpenCL program and cleaning up the GPU memory we used.

Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.
The actual OpenCL code to be executed. Right now it’s a simple kernel that adds two arrays and stores the result in a third.

util.h and util.cpp
Utility functions that make things like reading files or printing out OpenCL error messages easier

The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.

The Source Code Contents

Let’s follow the execution of the main function to see what order things need to be done in. Of course we need to include our library’s definitions:

#include "cll.h"

This includes the OpenCL headers as well as defines our CL class. If you look there you will see a few public members of type cl_mem which will point to arrays on our device. The private cl_* members are handles to key OpenCL objects which we will use in the constructor to setup OpenCL for execution.
The constructor is defined in cll.cpp where it does a few things: sets the platform, sets the device to use, creates the OpenCL context and a command queue. Let’s see how its done:

err = oclGetPlatformID(&platform);
printf("oclGetPlatformID: %s\n", oclErrorString(err));

The oclGetPlatformID function is a helper function defined in util.cpp I lifted from the NVIDIA SDK (I figure they won’t mind for educational purposes but it’s not under a free license so be careful!). This function looks for the NVIDIA platform and defaults to the first available one if its not found (“Apple” is the only platform on my Mac). The 2nd line is something you will see peppered throughout the code, oclErrorString(err) is another useful NVIDIA helper function that gives you a little more info about what went wrong if you have a problem. You can get some more explanation of the different error types if you go to the khronos specification for the function that errored, for example clGetDeviceIDs (a quick google search for the function name generally returns the khronos page first).

After we set the platform, we use it to select a device. In this code we get a list of the available devices but we end up just choosing the first in the list. This could of course become more sophisticated with multiple graphics cards, and in another tutorial we will discuss checking for device capabilities.

err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);
devices = new cl_device_id [numDevices];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);
deviceUsed = 0;

Note that we hardcoded the device type to be GPU, you can check the specification for the others but I’ll give you a hint that one of them starts with CL and ends with CPU ;)
Now we can make an OpenCL context using our device (the more advanced options like sharing an OpenGL context will come in a following tutorial):

context = clCreateContext(0, 1, &devices[deviceUsed], NULL, NULL, &err);

and the last action of the constructor is to create the command queue

command_queue = clCreateCommandQueue(context, devices[deviceUsed], 0, &err);

Once the constructor is finished, we want to load our OpenCL program, so in main.cpp we call


loadProgram is defined in cll.cpp, it simply reads in the .cl file as a string and passes the string to

program = clCreateProgramWithSource(context, 1, (const char **) &cSourceCL, &program_length, &err);

Here we load only one string of program_length characters, but you could do more at once.
Once the program is created, it must also be compiled so we call the private function buildExecutable to do that. It has some nice error checking that will output the build log if you have syntax errors in your OpenCL code. I won’t detail it here but check out the bottom of the cll.cpp file.

Once the program is loaded and built we are ready to pass the data to our device and do some computing! Let’s take a quick look at our simple kernel in so we have an idea of what we are trying to do:

__kernel void part1(__global float* a, __global float* b, __global float* c)
    unsigned int i = get_global_id(0);
    c[i] = a[i] + b[i];

Since I’m still forming my understanding of OpenCL I’ll avoid explaining the details (you and me both gotta read the spec! or check out some of NVIDIA’s nice webinars). This serves as a practical starting point for you to write your own kernels, so lets see what I’m doing here.
I define my kernel as part1 and it takes in three paramaters. The __global keyword says what kind of device memory our input is stored in, and besides that we should already be familiar with passing arrays to functions as a pointer.

One way to think of the kernel is as a replacement for a for loop (you can do much cooler things, but we gotta start somewhere). We get the index in the for loop from the get_global_id built in function, and then we do our operation on the arrays at that index. We let OpenCL split up the arrays into work-units and it will try to do as many of them as possible in parallel.

So how do we give OpenCL our arrays and tell it to do the work? That’s all in part1.cpp
We setup our kernel with the popCorn function (hehe)
First we go ahead and call

kernel = clCreateKernel(program, "part1", &err);

Then we setup the arrays we want to work on, I created a trivial example as you can see in the file. The important part is creating the OpenCL array buffers and pushing the data to the device:

//our input arrays
cl_a = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_COPY_HOST_PTR, sizeof(float) * num, a, &err);
cl_b = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * num, NULL, &err);
//our output array
cl_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * num, NULL, &err);

Note that our input are defined as read buffers, and our output is a write, there is also CL_MEM_READ_WRITE, and these are suggestions to the device on how you will use the buffers so it can optimize for performance. For the a array the clCreateBuffer function copies the data from the CPU to the GPU for us. If you want to separate creating the buffer from pushing the data you can do the data push as a separate call:

err = clEnqueueWriteBuffer(command_queue, cl_b, CL_TRUE, 0, sizeof(float) * num, b, 0, NULL, &event);

Then we tell the kernel which buffers correspond to which arguments

err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &cl_a);
err  = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &cl_b);
err  = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &cl_c);

You can also pass in other types of arguments, but we should see that in the next example.
We introduce


Which makes sure that all commands in the queue are done executing before the program continues. This will be more important in later tutorials where we run our kernel in a loop as fast as possible.
For now we just run it once in the runKernel() function:

err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, workGroupSize, NULL, 0, NULL, &event);

The workGroupSize is defined to be the same size as our data arrays, and so it is 1 dimensional. This will be much more interesting with more complex kernels, but for now we just make them the same.
Finally, we read from our device memory to see if the c array got updated the way we expected!

float *c_done;
err = clEnqueueReadBuffer(command_queue, cl_c, CL_TRUE, 0, sizeof(float) * num, c_done, 0, NULL, &event);

now we can simply print out the contents of the c_done array!

for(int i=0; i < num; i++)
    printf("c_done[%d] = %g\n", i, c_done[i]);

Hopefully the code and this walk-through give you a good starting point to learn more. Expect more tutorials, with the next one involving the OpenCL Profiler and determining device information, followed by OpenCL/OpenGL context sharing.
I definitely appreciate any feedback!

36 thoughts on “Adventures in OpenCL: Part 1, Getting Started

  1. Pingback: Adventures in OpenCL: Part 1.5, C++ Bindingsenj | enj

  2. Cedric


    I am a beginner using cuda and opencl and I’ve tried to setup correctly cmake on my computer, but I am getting an error
    “CMake Error: The following variables are used in this project, but they are set
    to NOTFOUND.
    Please set them or make sure they are set and tested correctly in the CMake file
    linked by target “part1.x” in directory C:/Dev/advcl/part1″
    I have a quadro FX 3800M with Notebook Developer Drivers 257.21.
    I rebooted after the installation.

    Do I need to set something special? Is OPENCL_LIBRARIES not set correctly by cmake?


  3. enj Post author

    Hey Cedric,

    Unfortunately I don’t have windows handy to test right now. You can look in the cmake/FindOPENC.cmake file to see how it is searching for OPENCL_LIBRARIES

    You could try finding out where OpenCL.lib or OpenCL.dll is on your system and adding a LINK_LIBARIES(C:/path/to/OpenCL.lib) into the CMakeLists.txt in part1/

    wish I could be more help

  4. Pingback: Adventures in OpenCL Part 2: Particles with OpenGL | enj

  5. Andrew Myers

    Just FYI the code in part1 will not compile on my system without including stdlib.h in cll.cpp because of the reference to free.

    korthan% nvcc –version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2010 NVIDIA Corporation
    Built on Mon_Jun__7_18:10:28_PDT_2010
    Cuda compilation tools, release 3.1, V0.2.1221
    korthan% gcc –version
    gcc (GCC) 4.4.4
    Copyright (C) 2010 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO

  6. Rich

    @Cedric: I ran into a similar problem. On my linux box, I found I had to change the CMakeLists.txt file line 20 from

    The macro is being expanded without proper quotations.

  7. dragon

    Hi, just trying your tutorial (and opencl for the first time). I am using a similar environment (snow leopard). First thing is that “cmake ..” barfs with environment variables “not set” : CMAKE_C_COMPILER, CMAKE_C_COMPILER_ENV_VAR, etc.
    probably should cover that in the README at least.

  8. dragon

    cmake fails for me with messages like: Missing variable is:

    -seems there is no “make” on my system, even tho I have xcode installed. System sounds similar to yours: snow leopard 10.6.4. I find it infernally difficult to get clear advice thru searches on where to get make for OSX.

  9. dragon

    Hi enj,
    make was in the developer directory as gnumake and there was a symlink to it, in Developer/usr/bin but that is not in my path. So I created a symlink to it in usr/bin.
    still fails with the environment variable error messages. Are you saying they are setup by macports?

  10. dragon

    this was the macports error:
    “Xcode is not installed, or was installed with UNIX Development (10.5+) or Command Line Support (10.4) deselected.”
    well xcode was installed and it was the latest … *but*:
    deleting xcode and reinstalling from a new download (version number of the new install from apple developer connection was:3.2.2, previous installed xcode was 3.2.3!) and macports installs…, go figure
    so “xcode ..” now works!

  11. enj Post author

    hey dragon,
    So does everything compile for you now? Getting a development environment set up nice is always a pain!

  12. Christian


    I would like to recommend you to rewrite the C-Tutorial. From my point of view, the C-bindings are quite important … lot of NVIDIA-Devs use them. AND since OpenCL IDC, the context initialization requires a platform as first parameter, somehow as seen in C++ -bindings. If anybody has them working for NV and ATI, please post it here … or a link to the source. I am working since the first OpenCL release, but until now, I was only able to write seperated programs, either NV or ATI.

    Little hint to the line to fix (as I’ve said, I’ve done it, but only as seperate lines, either working on AMD/ATI OR NV):

    context = clCreateContext(0, 1, &devices[deviceUsed], NULL, NULL, &err);

  13. Raj

    Hi Everyone,
    I want to execute OpenCL program in xcode.
    Please, someone help me…
    Its very urgent for me, if anybody know how to run it, please send me steps to follow on my Email ID: [email protected]

    I will be very very thankful for you!!!

    Thank You!
    IIIT-Allahabad, INDIA.

  14. Sergey L.


    Thanks for the tutorial. It was useful!

    The following lines should be inserted into the CL class constructor:

    cl_a = 0;
    cl_b = 0;
    cl_c = 0;

    Otherwise segmentation fault may occur if popCorn() method was not called during the program execution.

    Best regards,
    Sergey L.

  15. JB

    I was having the same problem as Cedric (above) except I’m using Ubuntu 10.04. Looking in cmake/FindOPENCL.cmake I found that the LD_LIBRARY_PATH environment variable was referenced when cmake used “FIND_LIBRARY” on the OPENCL_LIBRARIES variable (I’m not a cmake expert). I used:$echo $LD_LIBRARY_PATHand found that it wasn’t set…(strange but beside the point). So I set it to the path of my OpenCL .so files (similar to windows .dll files) for me this command was:$LD_LIBRARY_PATH=$/home/jb/ati-stream-sdk-v2.2-lnx64/lib/x86_64Then I followed the instructions in the tute and it worked. Hope this helps someone!

  16. Konrad


    thank you for this great tutorial!

    I had the same problem and solution to the “free()” problem. #include is definitively needed.

    Anyway, I’m new to OpenCL but have done some CUDA coding on my older laptop. My old laptop had a GPU that was to old to run the CUDA code on the GPU but at least I was able to debug my CUDA code in emulation mode. I was wondering if there is a way to emulate an OpenCL capable GPU just like with CUDA. I’m on Ubuntu 10.10 32 Bit.

    Thanks in advance

  17. Iarei

    I’m having some trouble getting the Makefile to recognize the correct path. I am completely new to using them, and I think some of the syntax changes for Windows systems. I’ve tried variations of
    $LD_LIBRARY_PATH = $ENV{ProgramFiles(x86)}\ATIStream\lib
    to no avail. Any assistance would be appreciated.

  18. Jack

    Nice tutorial. I was wondering how large an array my GPU could handle. It produces errors when I make them 10 million long. I was wondering how would people process larger arrays.

  19. enj Post author

    @Iarei: Unfortunately I can’t tell you exactly what to do, but the first thing I would try would be to put the entire path to ATIStream in your LD_LIBRARY_PATH,
    something like C:\Program Files\…\ATIStream\lib

    @Jack: Glad you like it, so you can find out the memory limit of your GPU with OpenCL calls, both NVIDIA and ATI have good example programs in their SDKs of how to do this. From there you can calculate the memory your array would take (10 million * 4bytes for floats usually) and see if it would fit. For arrays that are too large, usually people will break up the problem into several smaller problems and run those one at a time (or use multiple GPUs)
    When you start doing that you want to pay attention to other limits, like how fast the card can copy memory back and forth to the CPU, and how long each sub array takes to process.

  20. Jack

    Yesterday I had success with the Mac mini running Snow Leopard and a NVIDIA GeForce 320M.

    Today I am testing it on Windows 7 64-bit with an ATI Radeon HD 5700 Series. I have installed ati-stream-sdk-v2.3-vista-win7-64.exe
    Visual Studio C++ 2008 Express Edition, I compiled the sample code OpenCLSamples.sln
    All but SimpleDX10 compiled, I tested several of the sample opencl programs

    Now I am experiencing some difficulty trying Adventures in OpenCL: Part 1, Getting Started in this environment and I know I have not tried everything yet. I tried the cmake GUI native compiler visual studio 9 2008. I put the source and build as parent and child directories just as the terminal example on the mac.
    CMake module path: C:/Users/jshultz/Documents/Visual Studio 2008/Projects/adventures_in_opencl.git/part1/../cmake

    include directories:

    C:/Users/jshultz/Documents/Visual Studio 2008/Projects/adventures_in_opencl.git/part1/../opencl10

    Configuring done
    WARNING: Target “part1.x” requests linking to directory “C:\Users\jshultz\Documents\ATI Stream\lib\x86_64”. Targets may link only to libraries. CMake is dropping the item.
    WARNING: Target “part1.x” requests linking to directory “C:\Users\jshultz\Documents\ATI Stream\lib\x86_64”. Targets may link only to libraries. CMake is dropping the item.
    WARNING: Target “part1.x” requests linking to directory “C:\Users\jshultz\Documents\ATI Stream\lib\x86_64”. Targets may link only to libraries. CMake is dropping the item.
    WARNING: Target “part1.x” requests linking to directory “C:\Users\jshultz\Documents\ATI Stream\lib\x86_64”. Targets may link only to libraries. CMake is dropping the item.
    Generating done

    Build solution file in VS C++
    1>—— Build started: Project: part1, Configuration: Release Win32 ——
    1>cl : Command line warning D9002 : ignoring unknown option ‘-g’
    1>..\util.cpp(13) : warning C4996: ‘fopen’: This function or variable may be unsafe. Consider using fopen_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.
    1> C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include\stdio.h(237) : see declaration of ‘fopen’
    1>..\part1.cpp(69) : error C2057: expected constant expression
    1>..\part1.cpp(69) : error C2466: cannot allocate an array of constant size 0
    1>..\part1.cpp(69) : error C2133: ‘c_done’ : unknown size
    1>Generating Code…
    1>Build log was saved at “file://c:\Users\jshultz\Documents\Visual Studio 2008\Projects\adventures_in_opencl.git\part1\build\part1.dir\Release\BuildLog.htm”
    1>part1 – 3 error(s), 2 warning(s)
    2>—— Build started: Project: part1.x, Configuration: Release Win32 ——
    2>cl : Command line warning D9002 : ignoring unknown option ‘-g’
    2>LINK : fatal error LNK1181: cannot open input file ‘Release\part1.lib’
    2>Build log was saved at “file://c:\Users\jshultz\Documents\Visual Studio 2008\Projects\adventures_in_opencl.git\part1\build\part1.x.dir\Release\BuildLog.htm”
    2>part1.x – 1 error(s), 1 warning(s)
    3>—— Skipped Build: Project: ALL_BUILD, Configuration: Release Win32 ——
    3>Project not selected to build for this solution configuration
    ========== Build: 0 succeeded, 2 failed, 1 up-to-date, 1 skipped ==========

  21. Iarei

    @Jack: I’m about where you are in figuring this out. According to what I’m reading –
    ” Statically allocated arrays need to have a known length at compile-time. ” GCC apparently has some inherent way of handling this that VS lacks. So I guess you need to define NUM as a vector? It’s probably better if someone who knows what their talking about explains this.

  22. Pingback: Adventures in PyOpenCL: Part 1 Getting Started with Python | enj

  23. Krish

    i just started learning openCL
    my m/c is HP Z600 Workstation with Nvidia Quadro FX 3800
    i have installed nvida drivers & cuda computing sdk…..
    & all environmental vars. set…….
    OS: Fedora14 _ 64 Bit gcc: 4.5.2
    but wen i am compiling opencl progs it gives me error
    CL/cl_platform.h : No such file or directory ….

    pls guide me on how to rectify this………..

  24. Pingback: Intel’s OpenCL (Linux How-To) |

  25. Ben

    It is not working on my system, and I don’t know why. Can anyone help me please?
    This is the output I have seen.

    build$ ./part1.x
    Hello, OpenCL
    Initialize OpenCL object and context
    Available platforms:
    platform 0: ATI Stream
    selected platform: 0
    oclGetPlatformID: CL_SUCCESS
    clGetDeviceIDs (get number of devices): CL_DEVICE_NOT_FOUND
    clGetDeviceIDs (create device list): CL_INVALID_VALUE
    load the program
    path: /home/benjamin/workspace/enjalot-adventures_in_opencl-a5bb2a1/part1/
    clCreateProgramWithSource: CL_INVALID_CONTEXT
    building the program
    clBuildProgram: CL_INVALID_PROGRAM
    program built
    in popCorn
    clCreateKernel: CL_INVALID_PROGRAM
    Creating OpenCL arrays
    Pushing data to the GPU
    Segmentation fault

  26. John

    Thanks for the introduction. Part 1 works well on my windows system, with one minor change, as noted by Iarei you cannot use variable array sizes in MS Visual C. This is a C99 feature that MSVC does not implement. Replace these lines :
    float c_done[num];
    err = clEnqueueReadBuffer(command_queue, cl_c, CL_TRUE, 0, sizeof(float) *
    num, &c_done, 0, NULL, &event);
    with these
    std::vector c_done(num);
    err = clEnqueueReadBuffer(command_queue, cl_c, CL_TRUE, 0, sizeof(float) *
    num, &(c_done[0]), 0, NULL, &event);
    and add #include
    [Ugly but it works]
    I’m using Visual studio 2010, with CUDA 4.0.
    The -g compile flag is also not available – but only generates a compile time warning. You could consider wrapping it in IF(NOT MSVC).

  27. tester

    Help needed

    Just downloaded the example onto ubuntu system

    2.6.32-33-generic #72-Ubuntu SMP Fri Jul 29 21:07:13 UTC 2011 x86_64 GNU/Linux
    NVIDIA Quadro FX 3800
    Intel(R) Xeon(R) CPU
    gcc version 4.4.3

    Inside adventures_in_opencl/part1/build/ , after
    cmake ..

    I get
    Hello, OpenCL
    Initialize OpenCL object and context
    Error -1001 in clGetPlatformIDs Call !!!

    clGetDeviceIDs (get number of devices): CL_INVALID_PLATFORM
    clGetDeviceIDs (create device list): CL_INVALID_PLATFORM
    load the program
    path: /home/dbanks/www-home/src/adventures_in_opencl/part1/
    clCreateProgramWithSource: CL_INVALID_CONTEXT
    building the program
    clBuildProgram: CL_INVALID_PROGRAM
    ?program built
    in popCorn
    clCreateKernel: CL_INVALID_PROGRAM
    Creating OpenCL arrays
    Pushing data to the GPU
    zsh: segmentation fault ./part1.x

  28. Harini

    hey ,
    can u please help me out. am a student , and very new to openCL. my pc is having AMD processor with ATI graphic card , i have installed visual studio 10 and AMD APP KernelAnalyzer 1.9 , n included 10 header files given in kronous group into my include folder of vc. i am noting getting next steps to compile it , please tell me how should i debug, build ,compile n run my example programs. im using windows vista os. pls do reply need help badly

Comments are closed.