Particles in BGE: Fluid Surface and Collision in the works

Hello, World! Just keeping you guys up to date with what we are doing. I’d like to introduce someone who’s joining me in development of RTPS, Andrew Young. He’s a student at DSC interested in GPU computing and his current project is working on rendering techniques. Check us out:

[RTPS] Fluid Surface and Collisions in the works from Ian Johnson on Vimeo.

So this is kind of like a progress report with some pretty pictures, we still have a lot of work to do! Andrew is working on his blog where he will explain in more detail where he is at in his implementation of Screen Space Fluid Rendering and I will link to his post as soon as it’s up.

As for my progress, I’ve been slowed down by a couple of troubling bugs in the simulation as well as some idiosyncrasies of GPU programming. Things are a bit better now but I still have some hunting to do to find the root cause of the simulation bug, and there is some optimization and resource monitoring needed for OpenCL on older graphics cards.

At the same time I have been working on features like Triangle collision, which I consider half-way complete. The fluid particles collide with the triangles (obtained from the Blender object’s mesh, needs to be in triangle form, no quads) correctly, but the forces we use to influence the particles are not good enough. We have some ideas on how to improve this (involving placing “ghost” particles on the triangles) so hopefully we will have some pretty accurate collisions soon. Next on the list is improving the emitters so we can have a proper hose to spray fluid with different directions and initial velocities. I want to make either a custom object or modifier which will get rid of the awkward game properties settings I’m using now.

I have also been talking with another student in our group, Evan, about improving and expanding the OpenCL classes in RTPS for use in other libraries. This could potentially make life simpler for other developers, because we’ve already abstracted away a lot of the tedious details for stuff we use a lot (1D arrays, CL/GL context sharing) but we would need to properly engineer some other things (cl source code and kernel management, better context management and exposing more functionality).

I’m also quite excited about this proposal for an OpenCL based compositor. I’ve been learning a lot of OpenCL in the last months, but it will take a concerted effort of more developers to get projects like mine and the compositor ready for a consumer grade application like Blender.

Exciting times :)

Posted in blender, code, opencl | 6 Comments

sleeping on the edge, hope i don’t toss and turn

It was the beginning of 2007 when I first heard about Normalized Google Distance being used for automatic translation and I started to understand how powerful linear algebra could be. I couldn’t understand the paper but I knew about the 25 billion dollar eigenvector and that the best way to get involved in any of the new and interesting developments I was reading about would be to understand linear algebra.

It is now 2011 and I feel like I am involved in some exciting projects and looking forward to being part of many more. We have been simulating fluids in our lab and we just got two kinects which we will bend to our will. Both endeavors require intimate knowledge of point clouds, manipulating, visualizing, and analyzing them to our nefarious ends. In the very short time since I was introduced to CT scan data and expressed interest in CT related research its become apparent that there is a huge need for software that can deal with essentially a whole bunch of points. The mathematics behind the techniques to look at every aspect of the human body at once, to scan your brainread your thoughts and drive your car all depend on linear algebra to bring them to life. The reason your phone can solve sudoku and understand you when you speak to it is all because we can solve systems of equations. We don’t solve word problems anymore, the computer does that, we just have to come up with the right questions and the eigenvalues do the heavy lifting.

Four years after the beginning of my journey into math I feel increasingly closer to the edge. When I attempted to read the Normalized Google Distance paper the symbols and vocabulary slowed me to a halt. Since then many classes and lots of reading later I still lay no claim on understanding math, but I have been using techniques from papers as recent as one year ago. It wont be long before a paper from last year, and then a paper under review show me something I needed to solve a problem. After that there will be problems in front of me which haven’t been worked on before.

So I can’t wait to see what my fr ie nds and like minded internet acquaintances do. There are way too many interesting things to be involved in and not enough time!

Posted in life, math | 1 Comment

RTPS on my Android

A while back I posted about using the Android NDK to put particles on my android. Since then I’ve learned a good bit, made more progress on my primary project and I have a little something to show:

[RTPS] Android demo from Ian Johnson on Vimeo.

Sorry about the blurriness, my new camcorder doesn’t have a macro mode (but it is waterproof! won’t have to worry about my fluid simulations spilling out of the screen ;)

As part of the newly founded Tallahassee Android Developers group I did a presentation on using the NDK to code in C++ for Android. This presentation was done maybe a week before NDK-r5 came out so the parts about using a custom tool-chain to get STL support are outdated, but I go over getting started with the NDK and discuss how I used it in this project. The video is kind of long, and mostly a reference for those in the group, but feel free to check it out. The actual information in the video is distilled into 5 slides which have been updated for NDK-r5

Ok this is where things get technical, so if that ain’t your thing just close this window, pick up your droid and text your bff about me.

So I’m running Android 2.2 on my HTC Incredible. I’m using the Android NDK-r5, with the majority of my code as C++ (with STL, like std::vector and std::string) and OpenGL ES1.0

The interesting thing is that the code is only slightly modified from the CPU version of my RTPS project. They are still separate code bases, but I just copied the CPU version to a new folder and deleted every reference to OpenCL. If I was so inclined I could keep it as one codebase and use #DEFINEs to remove things at compile time, but I’d rather not complicate my thesis work and it’s actually a good mental workout to maintain both.

As I learn more about GLSL I’d like to start using OpenGL ES2.0 and shaders to make some cool effects. I’d like to do it in a cross platform way so that older phones can fall back on ES1.0 and the fixed pipeline. I also really need to make a Java interface so that you can switch between the Simple particles and the SPH, as well as change some parameters on the fly.

Posted in android, code | 1 Comment

Particles in BGE: Fluids in Real Time with OpenCL

Latest Update: Improved Code, Collisions and Hose

Gentlemen, behold! (Ladies too). At last I have something to show! My advisor and I have finally gotten to a point in our implementation of our Real Time Particle System that it can see the light of day. We have been working on an OpenCL implementation of the SPH method for simulating real time fluids. At this point it looks pretty good, which is the scientific way of saying there is still a lot of work to do. Take a look at the video to see it in action inside the Blender Game Engine!

Particles in BGE: Real-Time Fluids with OpenCL from Ian Johnson on Vimeo.

The rest of this post will be a quite technical discussion of what’s been done and what still needs to be done. If you just came for the demo and pretty pictures, reading any further puts you at risk of entering an extreme state of geekiness which can be quite disorienting and uncomfortable for those unaccustomed to the high.

The Project

So this project is currently split into two parts, the RTPS library and my modifications to the Blender Game Engine. The RTPS (Real Time Particle System) library has a standalone viewer for development and testing and serves as a limited example for the API. My Blender modifications are my hackish attempts to link against the library and provide an interface. One nice thing is that the library is compiled as a shared library meaning we can make modifications to it without recompiling Blender. I say hackish because I’m using a mix of a custom modifier and game properties to pass necessary information to the RTPS instance. Really I should have my own Particle System object, a separate Domain object and proper python hooks so that properties of the system could be dynamically changed by scripts and actuators. Luckily Blender has a great community and I’d like to shout out to Moguri for his help so far!

How Fast is Real Time?

So most gamers know about FPS (frames per second) being important to their experience, and the higher the number the better. A reasonable FPS for an interactive 3D game is 60fps, but 30fps can be deemed acceptable, and movies play back at 24fps. In the computational sciences we measure things in terms of milliseconds, so if you can do 60 frames per second it takes you 17ms to make one frame, or at 30fps it takes 33ms to make one frame. So we don’t have a lot of time to work with when building frames, not only do we need to compute the new position for all the fluid particles in that time, but also draw them and handle any of the rest of the stuff going on in the game engine!

So how fast are we going now? First it’s important to talk about what machine we are running on. Currently I have timings for a ’09 MacBook Pro which has an NVIDIA GeForce 9400M, a Dell T7500 with an ATI FirePro V7800 and a Dell T7500 with an NVIDIA GTX480.

The MacBook Pro is using Apple’s OpenCL drivers, this having the least powerful GPU we expect it to be the slowest. The two Dells are running Ubuntu 10.04, the one with the ATI card is running the ATI Stream SDK 2.2 and the one with the NVIDIA card is running the 260.24 driver. We recently obtained the ATI card, and haven’t had enough time to test things thoroughly.

The following timings are for the update loop in the RTPS library (thus all of the calls to OpenCL to update the positions of the particles based on the SPH method). This does not include rendering.

Card 4096 Particles 8192 Particles 16384 Particles
9400M 25ms 42ms 78ms
ATI FirePro V7800 7.5ms 8ms 8ms
NVIDIA GTX480 2ms 2.5ms 4.8ms

So right off the bat we see that the MacBook Pro is already too slow for desirable frame rates. The two powerful cards perform quite well however, and you may notice something strange about the ATI card, the timings stay the same!

One explanation that explains these timings well (besides the timings being completely wrong ;) is that the expensive part of this algorithm is the neighborhood calculations (update each particle’s force by doing calculations on all the particles around it). This requires lots of memory lookups which are expensive on the GPU. The 9400M doesn’t actually have onboard RAM so it has to go all the way to the CPU’s main memory to do those accesses. On the other end of the spectrum, the GTX480 has advanced caching mechanisms which handle our non-optimal implementation quite nicely. More testing is necessary on the ATI but it is my theory that the memory latency is completely hiding the computational cost of the neighboring routines.

We plan to implement more efficient data structures and algorithms to gain what we hope will be large speedups for older GPUs and perhaps the ATI as well.

As a word of warning, with our current setup I’ve been getting some nasty crashes with the ATI card, but we suspect it has something to do with all the duct-tape.

What’s Next?

My first priority is on improving the UI for the RTPS library in Blender. This includes integrating more properly, making more parameters accessible both during setup and at runtime. I want to give Python access so that particles can be dynamically emitted in many more ways. Collision detection should be coming back soon too! My advisor is interested in increasing the efficiency and accuracy of the SPH implementation. One of my fellow students, Andrew Young is working on a way to extract the surface so we can do some pretty rendering and he is also interested in a multi-GPU implementation.

Thanks

I’d like to thank:

Posted in blender, opencl | 26 Comments

10.10 Ubuntu Release Party – Tallahassee, FL

On Saturday October 9th we celebrated the new Ubuntu version (as well as my birthday) with a big bash. My friend Nathan Crock and I know how fun Ubuntu is but we wanted everyone else to know too. The evening started with the end of the FSU vs. Miami game projected on the screen outside. 2 Kegs and a well stocked (and super hot) bartender made sure no one (21 or older) was thirsty. Inside we had a live band who went by “The Super Users” rocking out while outside Dj Ozzie played MVs on the projector. Beer pong was played and the Ubuntu Girls walked around serving shots and educating guests about the wonders of Ubuntu.


You’ll notice from the pictures that we had a nerd themed party, as well as a costume contests where winners won Ubuntu shirts. After the contest we got everyone’s attention just before midnight to say a few words about Ubuntu, sing happy birthday and count down to midnight.


After plenty of merriment it was time to make good on our Office Space promise to smash some printers! The crowd went wild as 5 of the contest winners smashed 5 printers to pieces. The winners practiced the “sharing” aspect of Ubuntu and let others take swings at the hardware


Throughout the night we had about 200 people come through including some old, and hopefully many new Ubuntu fans. It was a lot of fun, and we hope to do it bigger in April! We hope to work closely with the Florida LoCo for the next release!

We also took 3D pictures with a camera from The Department of Scientific Computing at FSU that we will have to find a way to show off (a lot of the video above was shot with the FUJI 3D camera hence the relatively low resolution)

Posted in misc | 3 Comments

NVIDIA GTC 2010

NVIDIA GPU Technology Conference

This week I’m at the NVIDIA GPU Technology Conference (#GTC2010) learning as much as I can about GPU Computing! There are a lot of cool companies, interesting posters and thought provoking talks going on.

Of course I’ll be attending all the sessions about OpenCL optimization, and so far a few of the talks have mentioned OpenCL implementations in addition to their CUDA work. It sounds like people are getting pretty close performance with OpenCL, maybe only about 1.5x slower and the consensus seems to be that the hit is worth it for portability and if you only use NVIDIA stick with CUDA for the extra bit of performance.

The first day had a few pre-conference tutorials, including an intro to CUDA and an intro to OpenCL. I noticed two things, first that the OpenCL room seemed more full (probably because people attending GTC are more likely to already know CUDA). The second was that the CUDA presentation went under time by about 30 minutes, and the OpenCL presentation went about 30 minutes over schedule. It seems like the time needed to present an introduction linearly maps to the complexity of the language. A lot of people here are a bit pessimistic about OpenCL due mostly to its low level interface as opposed to the CUDA Runtime API as well as a lack of performance incentive if you already use an NVIDIA GPU (everyone here…). While they did list off all the ways they contributed to and are involved in OpenCL (one of their VPs chairs the board of the working group at Khronos) they are not shy about the fact that CUDA will remain far ahead of OpenCL in features and development. This makes sense since they directly control CUDA and have to submit to a committee for OpenCL, so it seems they will drag their heels as long as they don’t face real competition in GPGPU.

I remain optimistic about OpenCL in general, and this has less to do with GPUs and more with the reason for the standard in the first place, heterogeneous computing environments. Most importantly, mobile. There are over 5 billion wireless subscriptions worldwide, and NVIDIA’s CEO claims their mobile chip Tegra is a once-in-a lifetime opportunity to be prepared for the future. They are using ARM because it is the most proven architecture for mobile, but so are a lot of other people in the mobile space. ARM is predicting OpenCL on all smartphones, so I think OpenCL will continue to gain momentum even if it isn’t as exciting on the GPU as CUDA. I hope that increased demand and competition from the market will encourage NVIDIA to port more of their CUDA features over to OpenCL (ahem, templates please).

There are still 2 days of talks, so there is much yet to see and learn. I’ve already met some really cool people while standing by my poster (7mb pdf) and hope to meet many more!

Real time particle simulation in the Blender game engine with OpenCL

Posted in community, culture, opencl | 1 Comment

Adventures in OpenCL Part 2: Particles with OpenGL

20,000 particles being shared by OpenGL and OpenCL

This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality. This installment introduces OpenCL context sharing with OpenGL. We make a simple particle system to demonstrate this feature. One of the most important aspects of this feature is the time we can save by doing rendering and calculations on the same memory in the GPU, this means we don’t need to copy data back and forth!

You may want to grab the code and compile it to see it in action. As usual I recommend having a copy of the OpenCL specification handy.

In this tutorial I use the C++ bindings with less explanation, so see Part 1.5 for a more in-depth explanation or Part 1 to get started with C bindings. As before this code has been tested by me on NVIDIA hardware on my Macbook Pro and on an Ubuntu workstation, ATI and Windows users are encouraged to try it out and let me know if you have any problems so I can update the tutorial.

Sharing is Caring!

Lets see what we need to do to get OpenCL and OpenGL sharing their context. The first thing we need is an OpenGL context! For this tutorial I use GLUT to create a window that we can draw in, and GLEW for OpenGL extensions (at least on linux). This means you will need to have those headers and libraries installed on your system. Once you have those you can try building:

cd part2
mkdir build
cd build
cmake ..
make

The Source Code Files

Let’s go over the source files again, even though they are the same as the last tutorial we add more functionality in a couple of them, so it will be good to quickly go over the changes.

main.cpp
This is where we test out our CL class. We setup a GLUT window and OpenGL context. Then we instantiate our CL class, prepare a simple particle system and initialize it. At the end of the file are several helper functions for manipulating the OpenGL view with the mouse and keyboard.

cll.h
The main header file for our CL class definition, also handles including the OpenCL libraries. I’ve downloaded the header files from the Khronos website to avoid having to search the computer for a particular SDK. Note that I’ve had to make a slight change to cl.hpp for Mac users because of a bug in the implementation, which I will cover later.

cll.cpp
The core implementation of our CL class, including functions for initializing the OpenCL context from OpenGL, loading and building an OpenCL program.

part2.cpp
Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.

part2.cl
The actual OpenCL code to be executed. Right now it’s a simple particle system that models gravity.

util.h and util.cpp
Utility functions that make things like creating VBOs or printing out OpenCL error messages easier

CMakeLists.txt
The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.

The Source Code Contents

So our main.cpp is a bit messier than before, this is mostly just setting up the OpenGL stuff. You may already have some OpenGL context to plug your OpenCL stuff into, in which case you just want to make sure you are using Vertex Buffer Objects (VBOs) that OpenCL can use to create its buffers.
I will just point out some interesting bits that are important to OpenCL, starting with the number of particles and a pointer to our CL class:

#define NUM_PARTICLES 10000
CL* example;

and later we instantiate it like:

 example = new CL();

We do this so that we can access the object from the GLUT loop, specifically in the appRender function which is called by GLUT to update the display. Before we can talk about rendering though, we need to load some data! Take a look at the for loop where we populate three vectors of Vec4 which is just a typedef for 4 floats; x, y, z, w, declared in cll.h. We then pass these to our class which will push the data to the GPU.

example->loadData(pos, vel, color);
example->popCorn();

Lastly in main.cpp let’s take a look at the void appRender() function, where the first thing we do is update the particle system:

example->runKernel();

This is followed by the rendering code, which is standard OpenGL for drawing points from a VBO. I learned this from here.

Now let’s took at some changes to cll.cpp.
It is easier to just click the link and view the whole source file, since the code is too long for a snippet here, but the major addition is the way we ceate the context. Each operating system uses different extensions to accomplish the same thing, the code mostly comes from the NVIDIA GPU SDK examples, but I modified it to use the C++ bindings. In doing so I found an inconsistency in Apple’s implementation and had to add an extra constructor to the Context class in cl.hpp (around line 1448) to compensate.

We see the most OpenCL action in part2.cpp which also demonstrates creating a CL buffer from a GL buffer. First we create the GL buffer as a VBO

p_vbo = createVBO(&pos[0], array_size, GL_ARRAY_BUFFER, GL_DYNAMIC_DRAW);

The nice thing about std::vectors is that they store their elements in a tightly packed array so we can just pass the address of the first element. Next we store our vbos in another vector as cl::BufferGL objects:

cl_vbos.push_back(cl::BufferGL(context, CL_MEM_READ_WRITE, p_vbo, &err));

We don’t need to push any data to them like we do our pure OpenCL buffers because they just reference the data that is already in the VBO!

Our popCorn function just loads the kernel like before, and sets its arguments. Finally in the runKernel function we have to do a couple extra things to work with our VBOs, namely

err = queue.enqueueAcquireGLObjects(&cl_vbos, NULL, &event);

and

err = queue.enqueueReleaseGLObjects(&cl_vbos, NULL, &event);

Which acquires the buffers before we execute the kernel, and then releases them when we are done. This way we can safely work on the data without interfering with OpenGL or it interfering with our OpenCL.

The last thing to talk about is the cl code itself, but I’ve heavily commented the code so it will be easier to just go read the file :)

I hope this tutorial helps, please let me know if I’m mistaken anywhere so I can correct it for others reading! I’m learning a lot about particle systems for my master’s thesis at the Florida State University Department of Scientific Computing!

Posted in advcl, code, opencl, tutorial | 12 Comments

Particles in BGE Update: Collisions

We have collisions!
Check out the video to see how it looks

This is a preview of collisions working in our OpenCL Particle System addition to the Blender Game Engine. We can collide against triangle meshes of objects that are marked as colliders. The particle system and the objects can be manipulated through the logic bricks in real-time. The particles are also now rendered using global coordinates so that moving and rotating the emitter does not move the whole system.

On a Macbook Pro we can collide 1000 particles against 1000 triangles in a scene at about 50fps, with another advancement using bounding boxes to optimize that my DSC advisor Gordon wrote we should see about a 5x speedup. I just need to implement it into Blender! Also if one uses a lot less triangles (or carefully chooses the meshes you want to collide against) you can maintain 60fps. For now I’m showing “worst case” performance until we tune it and add more options to compare it to (bounding box, bounding sphere). On the GTX 480 (Fermi architecture) running on Ubuntu we can do 65k particles against the same 1000 triangles at 60fps. The fps doesn’t dip below 60 until about 100k particles.

Now that we see it working, I have a lot of cleaning up and benchmarking to do. Stay tuned for a more complete writeup!

Posted in misc | 13 Comments

Particles in BGE Update: GLSL, more options

So I’ve been making some headway with my OpenCL Particle Systems in the Blender Game Engine. I now have the option to render the particles with GLSL (Hi moguri, I’m going to need more help!), and I have improved my modifier to give UI access to several parameters, including which system of the few I’ve implemented you want to use.
Check out the youtube screencast:

I also preview my initial attempts at using images (textures) with the particle system. My example is pretty crude, I just load in an image with OpenCV (there are other ways, but I want to learn OpenCV too) and generate particles in a grid with the color of each particle set to the pixel values of the image. The next step is to use actual OpenGL textures and then of course interface with Blender materials.

I still have lots of work ahead of me, I want to make interaction with the system only affect generators (this means dealing with world coordinates instead of local for manipulation and rendering), collision is high up on the list of things to do, and now that I have GLSL working I need to learn how to use it to make some cool and efficient effects! In addition I’m starting to learn SPH which I’m really looking forward to!

I want to shout out to the Department of Scientific Computing for supporting this research, and check out our Intro to Game Design course we are offering again this fall!

Posted in blender, code, opencl | 15 Comments

Adventures in OpenCL: Part 1.5, C++ Bindings

This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality. This part is a reworking of my first tutorial using the OpenCL C++ Bindings. Learning by example works best for me so make sure to get the code! It can only help you to have a copy of the OpenCL specification handy, and it doesn’t hurt (too bad) to read it!

My code works for me on my Macbook Pro (with Geforce 9400M) running Snow Leopard with the NVIDIA GPU SDK as well as on the Ubuntu 10.4 workstations (with GTX 480 or Geforce 8800GTX). Unfortunately I haven’t spent any time developing on Windows so for now my tutorials will be UNIX centric (I would LOVE any help in setting up a windows environment, I’ll need to eventually for my Blender project). Also the code should build against the ATI Stream SDK and run on the runtime, you can assume I’m complying with OpenCL 1.0 (1.1 may be covered in the more advanced topics, and I’ll point it out). Please let me know if you have build problems or device issues! Throughout the tutorial I will refer to device and GPU interchangeably, OpenCL can run on CPUs already and is targeting many other devices, but for now I’m assuming a GPGPU bias.

Let’s get started!

You’ll need to have installed:

NVIDIA or ATI GPU SDK and OpenCL enabled drivers
(Ubuntu ATI users might like extra guidance)
CMake (introduction and in-depth tutorial [pdf])
and it helps to have Git (introduction and nice book)

Download the code and for the rest of the tutorial I will refer to the directory it’s in as advcl

In the advcl directory you should have the following directories:

part1.5/     //the source code files for this tutorial
cmake/       //CMake scripts that help locate necessary libraries
opencl10/    //OpenCL 1.0 header files (downloaded from Khronos.org)
opencl11/    //OpenCL 1.1 header files (downloaded from Khronos.org)

First we will build the code to make sure it works, I like to do an “out of source” build like so:

cd part1.5
mkdir build
cd build
cmake ..
make

This will generate all of the build files, the Makefile, the library and executable in the build directory which avoids cluttering up your source code directory. You can run the example like so:

./part1.x

The Source Code Files

Let’s first have a broad overview of each of the source files and then we can dive in and look at what the code is doing. I’ve the code up to be a library with a CL class that can be instantiated and utilized anywhere. For this tutorial it is not very generalized so that it’s easier to see what’s going on behind the scenes. In future tutorials we will refactor and make our library more powerful!

main.cpp
This is where we test out our CL class. We instantiate it, give it an opencl program to compile and run, then execute the kernel.

cll.h
The main header file for our CL class definition, also handles including the OpenCL libraries. I’ve downloaded the header files from the Khronos website to avoid having to search the computer for a particular SDK.

cll.cpp
The core implementation of our CL class, including functions for initializing the OpenCL context, loading and building an OpenCL program.

part1.cpp
Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.

part1.cl
The actual OpenCL code to be executed. Right now it’s a simple kernel that adds two arrays and stores the result in a third.

util.h and util.cpp
Utility functions that make things like reading files or printing out OpenCL error messages easier

CMakeLists.txt
The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.

The Source Code Contents

Let’s follow the execution of the main function to see what order things need to be done in. Of course we need to include our library’s definitions:

#include "cll.h"

This includes the OpenCL headers as well as defines our CL class. If you look there you will see a few public member objects of type cl::Buffer which will point to arrays on our device. The private cl::* member objects are key OpenCL objects which we will use in the constructor to setup OpenCL for execution.
The constructor is defined in cll.cpp where it does a few things: sets the platform, sets the device to use, creates the OpenCL context and a command queue. Let’s see how its done:

    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);
    printf("cl::Platform::get(): %s\n", oclErrorString(err));

This is the first example of the nice cl wrapper class specified in the C++ bindings where we get the device Platforms. As you can see it is much cleaner than the previous code. The 3rd line is something you will see peppered throughout the code, oclErrorString(err) is a useful helper function I lifted from the NVIDIA SDK that gives you a little more info about what went wrong if you have a problem. You can get some more explanation of the different error types if you go to the khronos specification for the function that errored, for example clGetDeviceIDs (a quick google search for the function name generally returns the khronos page first).

After we set the platform (for now we just select the first platform in the list), we use it to create a context. For this tutorial we choose the GPU to be the device. This could of course become more sophisticated with multiple graphics cards or other devices, and in another tutorial we will discuss checking for device capabilities.

    cl_context_properties properties[] =
        { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0};
    context = cl::Context(CL_DEVICE_TYPE_GPU, properties);
    devices = context.getInfo<CL_CONTEXT_DEVICES>();

and the last action of the constructor is to create the command queue

    deviceUsed = 0;
    try{
        queue = cl::CommandQueue(context, devices[deviceUsed], 0, &err);
    }
    catch (cl::Error er) {
        printf("ERROR: %s(%d)\n", er.what(), er.err());
    }

Here you see some use of exceptions provided by cl.hpp, note it is necessary to set

#define __CL_ENABLE_EXCEPTIONS

as I did at the top of cll.h.

Once the constructor is finished, we want to load our OpenCL program, so in main.cpp we do:

#include "part1.cl"
example.loadProgram("part1.cl");

You may wonder why we have an include here, I find it a nice way to bundle our OpenCL source with our library using a neat little macro trick.
loadProgram is defined in cll.cpp:

    int pl;
    cl::Program::Sources source(1,
        std::make_pair(kernel_source,pl));
    program = cl::Program(context, source);

Once the program is created, it must also be compiled. It also helps to have some compiler output incase we have syntax errors (never!).

    err = program.build(devices);
    printf("program.build: %s\n", oclErrorString(err));
    if(err != CL_SUCCESS){
        std::cout << "Build Status: " << program.getBuildInfo<CL_PROGRAM_BUILD_STATUS>(devices[0]) << std::endl;
        std::cout << "Build Options:\t" << program.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(devices[0]) << std::endl;
        std::cout << "Build Log:\t " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]) << std::endl;
    }

Once the program is loaded and built we are ready to pass the data to our device and do some computing! Let’s take a quick look at our simple kernel in part1.cl so we have an idea of what we are trying to do:

__kernel void part1(__global float* a, __global float* b, __global float* c)
{
    unsigned int i = get_global_id(0);
    c[i] = a[i] + b[i];
}

Since I’m still forming my understanding of OpenCL I’ll avoid explaining the details (you and me both gotta read the spec! or check out some of NVIDIA’s nice webinars). This serves as a practical starting point for you to write your own kernels, so lets see what I’m doing here.
I define my kernel as part1 and it takes in three paramaters. The __global keyword says what kind of device memory our input is stored in, and besides that we should already be familiar with passing arrays to functions as a pointer.

One way to think of the kernel is as a replacement for a for loop (you can do much cooler things, but we gotta start somewhere). We get the index in the for loop from the get_global_id built in function, and then we do our operation on the arrays at that index. We let OpenCL split up the arrays into work-units and it will try to do as many of them as possible in parallel.

So how do we give OpenCL our arrays and tell it to do the work? That’s all in part1.cpp
We setup our kernel with the popCorn function (hehe)
First we go ahead and call it (with error checking)

    try{
        kernel = cl::Kernel(program, "part1", &err);
    }
    catch (cl::Error er) {
        printf("ERROR: %s(%d)\n", er.what(), er.err());
    }

Notice that the string we pass in is the name of the kernel as we defined it in the .cl file.
Then we setup the arrays we want to work on, I created a trivial example as you can see in the file. The important part is creating the OpenCL array buffers and pushing the data to the device:

    size_t array_size = sizeof(float) * num;
    //our input arrays
    cl_a = cl::Buffer(context, CL_MEM_READ_ONLY, array_size, NULL, &err);
    cl_b = cl::Buffer(context, CL_MEM_READ_ONLY, array_size, NULL, &err);
    //our output arrayw
    cl_c = cl::Buffer(context, CL_MEM_WRITE_ONLY, array_size, NULL, &err);

Note that our input are defined as read buffers, and our output is a write, there is also CL_MEM_READ_WRITE, and these are suggestions to the device on how you will use the buffers so it can optimize for performance.
Pushing the data is the same for each one:

err = queue.enqueueWriteBuffer(cl_a, CL_TRUE, 0, array_size, a, NULL, &event);

Then we tell the kernel which buffers correspond to which arguments

    err = kernel.setArg(0, cl_a);
    err = kernel.setArg(1, cl_b);
    err = kernel.setArg(2, cl_c);

You can also pass in other types of arguments, but we should see that in the next installment.
We introduce

queue.finish();

Which makes sure that all commands in the queue are done executing before the program continues. This will be more important in later tutorials where we run our kernel in a loop as fast as possible.
For now we just run it once in the runKernel() function:

err = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(num), cl::NullRange, NULL, &event);

The cl::NDRange(num) is the global workgroup size, and is one-dimensional like our arrays.
Finally, we read from our device memory to see if the c array got updated the way we expected!

float *c_done;
err = queue.enqueueReadBuffer(cl_c, CL_TRUE, 0, sizeof(float) * num, c_done, NULL, &event);

now we can simply print out the contents of the c_done array!

for(int i=0; i < num; i++)
{
    printf("c_done[%d] = %g\n", i, c_done[i]);
}

Hopefully the code and this walk-through give you a good starting point to learn more. Expect more tutorials, with the next one involving the OpenCL Profiler and determining device information, followed by OpenCL/OpenGL context sharing.
I definitely appreciate any feedback!

Posted in advcl, code, opencl, tutorial | 11 Comments