Particles in BGE Update: GLSL, more options

So I’ve been making some headway with my OpenCL Particle Systems in the Blender Game Engine. I now have the option to render the particles with GLSL (Hi moguri, I’m going to need more help!), and I have improved my modifier to give UI access to several parameters, including which system of the few I’ve implemented you want to use.
Check out the youtube screencast:

I also preview my initial attempts at using images (textures) with the particle system. My example is pretty crude, I just load in an image with OpenCV (there are other ways, but I want to learn OpenCV too) and generate particles in a grid with the color of each particle set to the pixel values of the image. The next step is to use actual OpenGL textures and then of course interface with Blender materials.

I still have lots of work ahead of me, I want to make interaction with the system only affect generators (this means dealing with world coordinates instead of local for manipulation and rendering), collision is high up on the list of things to do, and now that I have GLSL working I need to learn how to use it to make some cool and efficient effects! In addition I’m starting to learn SPH which I’m really looking forward to!

I want to shout out to the Department of Scientific Computing for supporting this research, and check out our Intro to Game Design course we are offering again this fall!

Posted in blender, code, opencl | 6 Comments

Adventures in OpenCL: Part 1.5, C++ Bindings

This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality. This part is a reworking of my first tutorial using the OpenCL C++ Bindings. Learning by example works best for me so make sure to get the code! It can only help you to have a copy of the OpenCL specification handy, and it doesn’t hurt (too bad) to read it!

My code works for me on my Macbook Pro (with Geforce 9400M) running Snow Leopard with the NVIDIA GPU SDK as well as on the Ubuntu 10.4 workstations (with GTX 480 or Geforce 8800GTX). Unfortunately I haven’t spent any time developing on Windows so for now my tutorials will be UNIX centric (I would LOVE any help in setting up a windows environment, I’ll need to eventually for my Blender project). Also the code should build against the ATI Stream SDK and run on the runtime, you can assume I’m complying with OpenCL 1.0 (1.1 may be covered in the more advanced topics, and I’ll point it out). Please let me know if you have build problems or device issues! Throughout the tutorial I will refer to device and GPU interchangeably, OpenCL can run on CPUs already and is targeting many other devices, but for now I’m assuming a GPGPU bias.

Let’s get started!

You’ll need to have installed:

NVIDIA or ATI GPU SDK and OpenCL enabled drivers
CMake (introduction and in-depth tutorial [pdf])
and it helps to have Git (introduction and nice book)

Download the code and for the rest of the tutorial I will refer to the directory it’s in as advcl

In the advcl directory you should have the following directories:

part1.5/     //the source code files for this tutorial
cmake/       //CMake scripts that help locate necessary libraries
opencl10/    //OpenCL 1.0 header files (downloaded from Khronos.org)
opencl11/    //OpenCL 1.1 header files (downloaded from Khronos.org)

First we will build the code to make sure it works, I like to do an “out of source” build like so:

cd part1.5
mkdir build
cd build
cmake ..
make

This will generate all of the build files, the Makefile, the library and executable in the build directory which avoids cluttering up your source code directory. You can run the example like so:

./part1.x

The Source Code Files

Let’s first have a broad overview of each of the source files and then we can dive in and look at what the code is doing. I’ve the code up to be a library with a CL class that can be instantiated and utilized anywhere. For this tutorial it is not very generalized so that it’s easier to see what’s going on behind the scenes. In future tutorials we will refactor and make our library more powerful!

main.cpp
This is where we test out our CL class. We instantiate it, give it an opencl program to compile and run, then execute the kernel.

cll.h
The main header file for our CL class definition, also handles including the OpenCL libraries. I’ve downloaded the header files from the Khronos website to avoid having to search the computer for a particular SDK.

cll.cpp
The core implementation of our CL class, including functions for initializing the OpenCL context, loading and building an OpenCL program.

part1.cpp
Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.

part1.cl
The actual OpenCL code to be executed. Right now it’s a simple kernel that adds two arrays and stores the result in a third.

util.h and util.cpp
Utility functions that make things like reading files or printing out OpenCL error messages easier

CMakeLists.txt
The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.

The Source Code Contents

Let’s follow the execution of the main function to see what order things need to be done in. Of course we need to include our library’s definitions:

#include "cll.h"

This includes the OpenCL headers as well as defines our CL class. If you look there you will see a few public member objects of type cl::Buffer which will point to arrays on our device. The private cl::* member objects are key OpenCL objects which we will use in the constructor to setup OpenCL for execution.
The constructor is defined in cll.cpp where it does a few things: sets the platform, sets the device to use, creates the OpenCL context and a command queue. Let’s see how its done:

    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);
    printf("cl::Platform::get(): %s\n", oclErrorString(err));

This is the first example of the nice cl wrapper class specified in the C++ bindings where we get the device Platforms. As you can see it is much cleaner than the previous code. The 3rd line is something you will see peppered throughout the code, oclErrorString(err) is a useful helper function I lifted from the NVIDIA SDK that gives you a little more info about what went wrong if you have a problem. You can get some more explanation of the different error types if you go to the khronos specification for the function that errored, for example clGetDeviceIDs (a quick google search for the function name generally returns the khronos page first).

After we set the platform (for now we just select the first platform in the list), we use it to create a context. For this tutorial we choose the GPU to be the device. This could of course become more sophisticated with multiple graphics cards or other devices, and in another tutorial we will discuss checking for device capabilities.

    cl_context_properties properties[] =
        { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0};
    context = cl::Context(CL_DEVICE_TYPE_GPU, properties);
    devices = context.getInfo<CL_CONTEXT_DEVICES>();

and the last action of the constructor is to create the command queue

    deviceUsed = 0;
    try{
        queue = cl::CommandQueue(context, devices[deviceUsed], 0, &err);
    }
    catch (cl::Error er) {
        printf("ERROR: %s(%d)\n", er.what(), er.err());
    }

Here you see some use of exceptions provided by cl.hpp, note it is necessary to set

#define __CL_ENABLE_EXCEPTIONS

as I did at the top of cll.h.

Once the constructor is finished, we want to load our OpenCL program, so in main.cpp we do:

#include "part1.cl"
example.loadProgram("part1.cl");

You may wonder why we have an include here, I find it a nice way to bundle our OpenCL source with our library using a neat little macro trick.
loadProgram is defined in cll.cpp:

    int pl;
    cl::Program::Sources source(1,
        std::make_pair(kernel_source,pl));
    program = cl::Program(context, source);

Once the program is created, it must also be compiled. It also helps to have some compiler output incase we have syntax errors (never!).

    err = program.build(devices);
    printf("program.build: %s\n", oclErrorString(err));
    if(err != CL_SUCCESS){
        std::cout << "Build Status: " << program.getBuildInfo<CL_PROGRAM_BUILD_STATUS>(devices[0]) << std::endl;
        std::cout << "Build Options:\t" << program.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(devices[0]) << std::endl;
        std::cout << "Build Log:\t " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]) << std::endl;
    }

Once the program is loaded and built we are ready to pass the data to our device and do some computing! Let’s take a quick look at our simple kernel in part1.cl so we have an idea of what we are trying to do:

__kernel void part1(__global float* a, __global float* b, __global float* c)
{
    unsigned int i = get_global_id(0);
    c[i] = a[i] + b[i];
}

Since I’m still forming my understanding of OpenCL I’ll avoid explaining the details (you and me both gotta read the spec! or check out some of NVIDIA’s nice webinars). This serves as a practical starting point for you to write your own kernels, so lets see what I’m doing here.
I define my kernel as part1 and it takes in three paramaters. The __global keyword says what kind of device memory our input is stored in, and besides that we should already be familiar with passing arrays to functions as a pointer.

One way to think of the kernel is as a replacement for a for loop (you can do much cooler things, but we gotta start somewhere). We get the index in the for loop from the get_global_id built in function, and then we do our operation on the arrays at that index. We let OpenCL split up the arrays into work-units and it will try to do as many of them as possible in parallel.

So how do we give OpenCL our arrays and tell it to do the work? That’s all in part1.cpp
We setup our kernel with the popCorn function (hehe)
First we go ahead and call it (with error checking)

    try{
        kernel = cl::Kernel(program, "part1", &err);
    }
    catch (cl::Error er) {
        printf("ERROR: %s(%d)\n", er.what(), er.err());
    }

Notice that the string we pass in is the name of the kernel as we defined it in the .cl file.
Then we setup the arrays we want to work on, I created a trivial example as you can see in the file. The important part is creating the OpenCL array buffers and pushing the data to the device:

    size_t array_size = sizeof(float) * num;
    //our input arrays
    cl_a = cl::Buffer(context, CL_MEM_READ_ONLY, array_size, NULL, &err);
    cl_b = cl::Buffer(context, CL_MEM_READ_ONLY, array_size, NULL, &err);
    //our output arrayw
    cl_c = cl::Buffer(context, CL_MEM_WRITE_ONLY, array_size, NULL, &err);

Note that our input are defined as read buffers, and our output is a write, there is also CL_MEM_READ_WRITE, and these are suggestions to the device on how you will use the buffers so it can optimize for performance.
Pushing the data is the same for each one:

err = queue.enqueueWriteBuffer(cl_a, CL_TRUE, 0, array_size, a, NULL, &event);

Then we tell the kernel which buffers correspond to which arguments

    err = kernel.setArg(0, cl_a);
    err = kernel.setArg(1, cl_b);
    err = kernel.setArg(2, cl_c);

You can also pass in other types of arguments, but we should see that in the next installment.
We introduce

queue.finish();

Which makes sure that all commands in the queue are done executing before the program continues. This will be more important in later tutorials where we run our kernel in a loop as fast as possible.
For now we just run it once in the runKernel() function:

err = queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(num), cl::NullRange, NULL, &event);

The cl::NDRange(num) is the global workgroup size, and is one-dimensional like our arrays.
Finally, we read from our device memory to see if the c array got updated the way we expected!

float *c_done;
err = queue.enqueueReadBuffer(cl_c, CL_TRUE, 0, sizeof(float) * num, c_done, NULL, &event);

now we can simply print out the contents of the c_done array!

for(int i=0; i < num; i++)
{
    printf("c_done[%d] = %g\n", i, c_done[i]);
}

Hopefully the code and this walk-through give you a good starting point to learn more. Expect more tutorials, with the next one involving the OpenCL Profiler and determining device information, followed by OpenCL/OpenGL context sharing.
I definitely appreciate any feedback!

Posted in code, opencl | Leave a comment

Adventures in OpenCL: Part 1, Getting Started

This tutorial series is aimed at developers trying to learn OpenCL from the bottom up, with a focus on practicality (i.e. I’m still learning, I’m sharing what I’ve found to work). Learning by example works best for me so make sure to get the code! It can only help you to have a copy of the OpenCL specification handy, and it doesn’t hurt (too bad) to read it!

NOTE: If you are interested in using the C++ Bindings, I recommend you check out my revision of this tutorial.

My code works for me on my Macbook Pro (with Geforce 9400M) running Snow Leopard with the NVIDIA GPU SDK as well as on the Ubuntu 10.4 workstations (with GTX 480 or Geforce 8800GTX). Unfortunately I haven’t spent any time developing on Windows so for now my tutorials will be UNIX centric (I would LOVE any help in setting up a windows environment, I’ll need to eventually for my Blender project). Also the code should build against the ATI Stream SDK and run on the runtime, you can assume I’m complying with OpenCL 1.0 (1.1 may be covered in the more advanced topics, and I’ll point it out). Please let me know if you have build problems or device issues! Throughout the tutorial I will refer to device and GPU interchangeably, OpenCL can run on CPUs already and is targeting many other devices, but for now I’m assuming a GPGPU bias.

Let’s get started!

You’ll need to have installed:

NVIDIA or ATI GPU SDK and OpenCL enabled drivers
CMake (introduction and in-depth tutorial [pdf])
and it helps to have Git (introduction and nice book)

Download the code and for the rest of the tutorial I will refer to the directory it’s in as advcl

In the advcl directory you should have the following directories:

part1/       //the source code files for this tutorial
part1.5/     //the source code for the part1.5 (C++ bindings)
cmake/       //CMake scripts that help locate necessary libraries
opencl10/    //OpenCL 1.0 header files (downloaded from Khronos.org)
opencl11/    //OpenCL 1.1 header files (downloaded from Khronos.org)

First we will build the code to make sure it works, I like to do an “out of source” build like so:

cd part1
mkdir build
cd build
cmake ..
make

This will generate all of the build files, the Makefile, the library and executable in the build directory which avoids cluttering up your source code directory. You can run the example like so:

./part1.x

The Source Code Files

Let’s first have a broad overview of each of the source files and then we can dive in and look at what the code is doing. I’ve the code up to be a library with a CL class that can be instantiated and utilized anywhere. For this tutorial it is not very generalized so that it’s easier to see what’s going on behind the scenes. In future tutorials we will refactor and make our library more powerful!

main.cpp
This is where we test out our CL class. We instantiate it, give it an opencl program to compile and run, then execute the kernel.

cll.h
The main header file for our CL class definition, also handles including the OpenCL libraries on both Linux and Mac.

cll.cpp
The core implementation of our CL class, including functions for initializing the OpenCL context, loading and building an OpenCL program and cleaning up the GPU memory we used.

part1.cpp
Implementation of the functions that setup and run the OpenCL kernel. This is where we actually see OpenCL in action.

part1.cl
The actual OpenCL code to be executed. Right now it’s a simple kernel that adds two arrays and stores the result in a third.

util.h and util.cpp
Utility functions that make things like reading files or printing out OpenCL error messages easier

CMakeLists.txt
The configuration and build script used to build the project. This makes it easier to be portable, and building our code as a library makes it easier to contribute to other projects.

The Source Code Contents

Let’s follow the execution of the main function to see what order things need to be done in. Of course we need to include our library’s definitions:

#include "cll.h"

This includes the OpenCL headers as well as defines our CL class. If you look there you will see a few public members of type cl_mem which will point to arrays on our device. The private cl_* members are handles to key OpenCL objects which we will use in the constructor to setup OpenCL for execution.
The constructor is defined in cll.cpp where it does a few things: sets the platform, sets the device to use, creates the OpenCL context and a command queue. Let’s see how its done:

err = oclGetPlatformID(&platform);
printf("oclGetPlatformID: %s\n", oclErrorString(err));

The oclGetPlatformID function is a helper function defined in util.cpp I lifted from the NVIDIA SDK (I figure they won’t mind for educational purposes but it’s not under a free license so be careful!). This function looks for the NVIDIA platform and defaults to the first available one if its not found (“Apple” is the only platform on my Mac). The 2nd line is something you will see peppered throughout the code, oclErrorString(err) is another useful NVIDIA helper function that gives you a little more info about what went wrong if you have a problem. You can get some more explanation of the different error types if you go to the khronos specification for the function that errored, for example clGetDeviceIDs (a quick google search for the function name generally returns the khronos page first).

After we set the platform, we use it to select a device. In this code we get a list of the available devices but we end up just choosing the first in the list. This could of course become more sophisticated with multiple graphics cards, and in another tutorial we will discuss checking for device capabilities.

err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);
devices = new cl_device_id [numDevices];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);
deviceUsed = 0;

Note that we hardcoded the device type to be GPU, you can check the specification for the others but I’ll give you a hint that one of them starts with CL and ends with CPU ;)
Now we can make an OpenCL context using our device (the more advanced options like sharing an OpenGL context will come in a following tutorial):

context = clCreateContext(0, 1, &devices[deviceUsed], NULL, NULL, &err);

and the last action of the constructor is to create the command queue

command_queue = clCreateCommandQueue(context, devices[deviceUsed], 0, &err);

Once the constructor is finished, we want to load our OpenCL program, so in main.cpp we call

example.loadProgram("part1.cl");

loadProgram is defined in cll.cpp, it simply reads in the .cl file as a string and passes the string to

program = clCreateProgramWithSource(context, 1, (const char **) &cSourceCL, &program_length, &err);

Here we load only one string of program_length characters, but you could do more at once.
Once the program is created, it must also be compiled so we call the private function buildExecutable to do that. It has some nice error checking that will output the build log if you have syntax errors in your OpenCL code. I won’t detail it here but check out the bottom of the cll.cpp file.

Once the program is loaded and built we are ready to pass the data to our device and do some computing! Let’s take a quick look at our simple kernel in part1.cl so we have an idea of what we are trying to do:

__kernel void part1(__global float* a, __global float* b, __global float* c)
{
    unsigned int i = get_global_id(0);
    c[i] = a[i] + b[i];
}

Since I’m still forming my understanding of OpenCL I’ll avoid explaining the details (you and me both gotta read the spec! or check out some of NVIDIA’s nice webinars). This serves as a practical starting point for you to write your own kernels, so lets see what I’m doing here.
I define my kernel as part1 and it takes in three paramaters. The __global keyword says what kind of device memory our input is stored in, and besides that we should already be familiar with passing arrays to functions as a pointer.

One way to think of the kernel is as a replacement for a for loop (you can do much cooler things, but we gotta start somewhere). We get the index in the for loop from the get_global_id built in function, and then we do our operation on the arrays at that index. We let OpenCL split up the arrays into work-units and it will try to do as many of them as possible in parallel.

So how do we give OpenCL our arrays and tell it to do the work? That’s all in part1.cpp
We setup our kernel with the popCorn function (hehe)
First we go ahead and call

kernel = clCreateKernel(program, "part1", &err);

Then we setup the arrays we want to work on, I created a trivial example as you can see in the file. The important part is creating the OpenCL array buffers and pushing the data to the device:

//our input arrays
cl_a = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * num, NULL, &err);
cl_b = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * num, NULL, &err);
//our output array
cl_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * num, NULL, &err);

Note that our input are defined as read buffers, and our output is a write, there is also CL_MEM_READ_WRITE, and these are suggestions to the device on how you will use the buffers so it can optimize for performance.
Pushing the data is the same for each one:

err = clEnqueueWriteBuffer(command_queue, cl_a, CL_TRUE, 0, sizeof(float) * num, a, 0, NULL, &event);

Then we tell the kernel which buffers correspond to which arguments

err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &cl_a);
err  = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &cl_b);
err  = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &cl_c);

You can also pass in other types of arguments, but we should see that in the next example.
We introduce

clFinish(command_queue);

Which makes sure that all commands in the queue are done executing before the program continues. This will be more important in later tutorials where we run our kernel in a loop as fast as possible.
For now we just run it once in the runKernel() function:

err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, workGroupSize, NULL, 0, NULL, &event);

The workGroupSize is defined to be the same size as our data arrays, and so it is 1 dimensional. This will be much more interesting with more complex kernels, but for now we just make them the same.
Finally, we read from our device memory to see if the c array got updated the way we expected!

float *c_done;
err = clEnqueueReadBuffer(command_queue, cl_c, CL_TRUE, 0, sizeof(float) * num, c_done, 0, NULL, &event);

now we can simply print out the contents of the c_done array!

for(int i=0; i < num; i++)
{
    printf("c_done[%d] = %g\n", i, c_done[i]);
}

Hopefully the code and this walk-through give you a good starting point to learn more. Expect more tutorials, with the next one involving the OpenCL Profiler and determining device information, followed by OpenCL/OpenGL context sharing.
I definitely appreciate any feedback!

Posted in code, opencl | 6 Comments

20 million particles in OpenCL on the GTX480

I’m trying to push the limits of interactive scientific visualization, so I gotta start somewhere! Ok, so 20 million particles running at ~40fps is not a bad start, but this is also using the simplest rendering possible (GL_POINTS of size 1, with no alpha blending or lighting).

I’m running my code on Ubuntu 10.04 on a Dell Precision T7500 with a NVIDIA GTX480 (1.5GB GDDR5). The machine has 12GB of ram and 8 cores of CPU which don’t get used at all by this (besides to initialize the system and run the main loop of course). Check out the video to see some pretty patterns and a large amount of particles!

A quick note for those waiting for these in Blender, currently I have it working but only with about 100k particles if I use a Mesh as the emitter (any more and Blender starts choking before I get in the game engine). I’m going to start looking at the existing particle interface (as well as the redesign) to start integrating tighter into Blender now that I’m getting comfortable with OpenCL/OpenGL.

Let’s talk some math and look at some numbers, if that doesn’t sound fun it’s ok to stop reading now ;)

So first let me admit that this OpenCL kernel is not doing that much work. I’m solving the Lorenz Attractor ODEs with the RK4 method for each particle. It’s all embarrassingly parallel and there is no interaction between them. Second, I tuned down the rendering so it’s doing as little as possible. This lets us get to the memory limit of the GTX480 at 20 million particles with my setup. Lets see why:

I use float4 arrays (OpenCL variables of 4 32bit floats in a row) for vertices, colors, generators and velocities. I use one float array to keep track of the life of each particle.

4 (arrays) x 20,000,000 (particles) x 4 (floats) x 4 (bytes) = 1,280,000,000 bytes

1 (array) x 20,000,000 (particles) x 1 (float) x 4 (bytes) = 80,000,000 bytes

So thats 1,360,000,000 bytes / 1024E3 = 1.267GB of memory on the graphics card!

Luckily I’m using OpenGL interop, so the vertices and color array are actually VBOs in OpenGL’s context and are modified in place. Since that’s the case we don’t transfer any of that memory back to the CPU, which would be a serious problem. It’s possible my kernel could be optimized more, but right now memory and rendering are the limiting factors. When I start implementing more interesting physics like collision detection and fluid dynamics this will be a bigger issue. I’m also planning to implement depth sorting using an index array VBO so I can render cool effects with proper alpha blending. This of course will also limit the number of particles possible, with my guess being that rendering will be the biggest culprit (not memory).

1 million particles

1 million particles with alpha blending and pointsize=10

Posted in blender, code, opencl | 4 Comments

Blender Game Engine: Particles in the Mix

I think I can show you better than I can tell you

check out the code

This is the first major milestone in my journey! I’ve got a lot of work to do before this is in any usable condition for artists, but none the less it feels good to see my OpenCL Particle System and interact with it through the Blender Game Engine. Right now I’ve linked to my EnjaParticles library from Blender, I create a system using a custom modifier (called Enja for now) on an object and if the modifier is present I divert rendering of that object to a custom function.

My next step is to get a better understanding of all the OpenGL magic going on because my colors and alpha blending are not behaving as expected. At the same time I’m thinking I’ll rewrite the particle system code inside the BGE and start looking at how the existing particle system works. I’m going to start with a small subset of the existing functionality, so please let me know if you’re dying to have a certain particle effect in the game engine so I can prioritize better, otherwise I’ll just do what looks the most fun.

I want to spend some more time thinking about handling the OpenCL context, perhaps loading it, building the programs and preparing the kernels when the game engine starts. Right now an entirely new context is created for each system which seems quite wasteful. I also need to clean up the build process, which should be helped by moving the code internal to blender, but making OpenCL available to the whole of blender might encourage other developers to try their hand at accelerating other areas.

Posted in blender, code, opencl | 6 Comments

Simple Particles with OpenCL and OpenGL

And so I reach another milestone in my journey to put OpenCL Particles in Blender! I’ve written my own (very) simple particle system from scratch in C++ using OpenGL and OpenCL, taking advantage of VBO interoperability. OpenCL and OpenGL interop is all about keeping the processing on the GPU, so the array of vertices you are drawing from and the array of vertices you are moving around are actually pointers to the same spot in GPU memory. This saves lots of time, especially if you are manipulating millions of particles in real time and you don’t want to transfer megabytes of data back and forth each frame!

Check out the code

I based my work on the NVIDIA GPU SDK example oclSimpleGL, but I wanted to make my code a standalone library that could be popped into some other graphics context (namely Blender). I was able to nicely separate my code into relevant files: opencl.cpp for opencl functions, enja.cpp for all the particle stuff and main.cpp is responsible for the window/GLUT and rendering stuff. Instantiating an EnjaParticles class creates a system and populates two VBOs, one for vertices and one for colors as well as an OpenCL context and all the necessary OpenCL variables. All one has to do is call the update function and the OpenCL kernel is executed, automatically updating the vbos which can be rendered by OpenGL. Right now the constructor for the particle system only takes in an array for vertices (which I intend to get from a DerivedMesh object in Blender) and makes the VBO ids available so the rendering context can use them.

The code currently compiles and runs on both Ubuntu with NVIDIA drivers on a GTX480 (yeah baby!) and my Macbook Pro with NVIDIA GeForce 9400M (it shouldn’t be hard to port to windows but I don’t have time for that now). I use CMAKE so check out the README to make sure you configure your environment correctly (you need to set up a couple environment variables so it can find the right headers to include and libraries to link against). In the CMakeLists.txt file you can also turn GL_INTEROP on and off if you want to see the performance difference. This code also runs in the OpenCL Visual Profiler, which wasn’t the easiest thing to get working. You need to be very careful not only with the OpenCL objects you instantiate, but ones that implicitly get created by calling cl functions! Once it works it’s very nice for seeing how your code is behaving on the GPU.

I’ve noticed that on my MacBook Pro the GL/CL interop is not behaving as expected, and I suspect that there is still memory being transfered. Without the opencl profiler I will have to do my own timings to get to the bottom of this. Adding to my suspicion is that there is no difference in performance for the oclSimpleGL example on my MBP.

It took me over a week to get this working like I expected to, I don’t intend it for it to take that long to get it into Blender. I really want to start working on interesting stuff like Rigid Body collision and python interaction, let alone more complex physics!

screenshot of 1million particles

1,000,000 particles strong!

Posted in blender, code, opencl | 9 Comments

Particles on my Android

Since I just got my first Android phone (an HTC Eris while I wait for my HTC Incredible to be shipped)

and I read about the Android NDK as well as OpenGL ES support it seemed like a good idea to write a simple particle system in C as practice for my Blender OpenCL Particle project. For some reason I also find it plain cool to write in C on a phone with such a nice SDK and support system.

I’ve uploaded my code in hopes that others trying to get into C development with OpenGL may benefit from this example. I based the Java part off the san-angeles demo in the NDK and added more touch interaction, dragging rotates the scene and touching moves the emitter of the particles. The system is pretty simple right now and I still need to get more familiar with OpenGL. I’m looking forward to trying OpenGL ES 2.0 but I need to wait for my Incredible since the Eris doesn’t support it.

This was a fun exercise and it feels good to get some fairly low level control over my phone. I need to focus more on the OpenCL acceleration of particle system’s on bigger GPUs, but luckily almost all of the C code could be copy pasted into a GLUT project and run on a PC. There are some interesting differences that need to be considered like the lack of floating point support in most Android hardware. As a hobby I’m going to explore more interesting effects on the phone, and for work I’m going to add timing/profiling and keep the code modular so I can measure performance on many different platforms (now including phones!).

Hmm… what if we ported the Blender Game Engine to Android with the NDK? I hear they have Bullet Physics already… ok I’m getting ahead of myself!

Posted in android, code | 2 Comments

Blender: Creating a Custom Modifier

On my quest to create an OpenCL enabled Particle System I realized that I need a way for the Blender Game Engine to know when to render an object as a particles as opposed to a regular mesh. A straight forward way to achieve this (with other benefits) would be to create a Particle modifier that can be added to an object which the game engine can then check for. I’d like to thank Moguri for this idea!

So I set out today to make a custom modifier that does nothing but printf its existence! I figured this out by copying the SoftBody modifier and scouring the source code for all references to it. I called my modifier Enja, so just replace any instance of Enja in this post with your own chosen name.

First we want to create our main modifier code file (all paths start in the source folder)

blender/modifiers/intern/MOD_enja.c

cp blender/modifiers/intern/MOD_softbody.c blender/modifiers/intern/MOD_enja.c

In here you will add functionality and of course replace all instances of SoftBody with Enja. You can see my modified MOD_enja.c

The rest of the instructions are just modifying files (if you are confused about where things go, just look at Softbody!

At line 184 of blender/modifiers/intern/MOD_util.c

	INIT_TYPE(Enja);

At line 70 of blender/modifiers/MOD_modifiertypes.h

extern ModifierTypeInfo modifierType_Enja;

At line 564 of: blender/makesrna/RNA_access.h

extern StructRNA RNA_EnjaModifier;

At line 303 of: blender/blenkernel/BKE_modifier.h

int modifiers_isEnjaEnabled(struct Object *ob);

We need to edit blender/makesdna/DNA_modifier_types.h in several places:

on line 69, just after eModifierType_Screw and before NUM_MODIFIER_TYPES add

eModifierType_Enja,

on line 441, after the SoftbodyModifierData struct, add

typedef struct EnjaModifierData {
    ModifierData modifier;
    int system;
} EnjaModifierData;
on line 85 of blender/makesrna/intern/rna_modifier.c (before the {0,NULL,0,NULL,NULL} entry:
{eModifierType_Enja, "ENJA", ICON_MOD_SOFT, "Enja", ""},
line 2210:
static void rna_def_modifier_enja(BlenderRNA *brna)
{

    StructRNA *srna;
    PropertyRNA *prop;

    srna= RNA_def_struct(brna, "EnjaModifier", "Modifier");
    RNA_def_struct_ui_text(srna, "Enja Modifier", "Add a particle system");
    RNA_def_struct_sdna(srna, "EnjaModifierData");
    RNA_def_struct_ui_icon(srna, ICON_MOD_SOFT);

    prop= RNA_def_property(srna, "system", PROP_INT, PROP_NONE);
    RNA_def_property_int_sdna(prop, NULL, "system");
    // we should use an enum but this is hacked together for now
    // this range is to allow the user to select a different system
    RNA_def_property_ui_range(prop, 0, 2, 1, 0);
    RNA_def_property_ui_text(prop, "Systems", "Available particle systems");
    RNA_def_property_update(prop, 0, "rna_Modifier_update");

}
line 2337:
rna_def_modifier_enja(brna);
To add the modifier to our UI we need to edit one more file: ../release/scripts/ui/properties_data_modifier.py around line 624
def ENJA(self, layout, ob, md, wide_ui):
    layout.label(text="System:")
    layout.prop(md, "system")
    layout.label(text="0: lorentz 1: gravity")
This should build and give you a modifier under the Simulate header that does nothing but print out a message when you add it. My next step is to check for this modifier in the Game Engine and of course, start adding some functionality! I wish I could explain more about what each of these lines do, but I’m still trying to understand the RNA/DNA system and the Blender way of doing things.
Posted in blender | Leave a comment

Blender and OpenCL: The Journey Begins

And so I embark upon an epic journey, just like that I leave my normal life behind and dive into the depths of Blender’s source code armed only with courage inspired by the powers and promises of OpenCL. The treasure I seek is an interactive particle system in the Blender Game Engine, accelerated by OpenCL and capable of visualizing millions of particles. I am not so greedy to desire collisions and sophisticated particle physics on my first quest, interactive visualization of scientific simulations will be a worthy reward for a summers work.

One does not embark on such a voyage without any direction, so I have set forth a preliminary map to guide my efforts as I learn more and triumph over unforeseen perils. The Blender Game Engine (BGE) is made up of several models that interact to provide an interactive 3D experience. The 3D objects are defined in the Blender Kernel, typically they are mesh based and efficiently store the coordinates of their vertices in arrays. The 3D objects are rendered to the screen by the OpenGL Rasterizer which act upon these vertex arrays. The objects are interacted with and manipulated by the Game Logic Module, the meat of which is found in different Actuators provided by the BGE. The particle system I envision is essentially a bunch of simple points that move around the screen based on some defined behavior patterns. This means we want a custom Blender object where the mesh vertices serve as the location of each particle. We want to do custom rendering where we draw a little ball (or use a trick called billboarding for more speed) at each vertex to represent the particle. We can then create a custom actuator which will define the behavior of the particles by changing their location, color and size based on user defined functions (or later, physics).

Each journey has a first step, and mine starts with the rendering aspect. I was able to find the OpenGL code where (VertexArray) meshes are drawn to the screen, and inserted my own code to draw simple gluSpheres at each vertex.

The next step in the rendering aspect will be to figure out where each object in the scene is added to the list of items to be drawn and call a custom rendering routine for only the cube. Once I have a better understanding of this process I can make the particle rendering more general so that any object of a certain type (my future particle type) or that have my future particle actuator will be rendered in this way.

I’ll be setting up a git repo for what I’m doing but for now one could add the following snippet around line 170 of

source/gameengine/Rasterizer/RAS_OpenGLRasterizer/RAS_VAOpenGLRasterizer.cpp

//try to draw at vertices
glColor4f(0.5f, 0.0f, 0.0f, 1.0f);
//iterate over vertex arrays
int iti = 0;
for(ms.begin(it); !ms.end(it); ms.next(it)) {
  //normal meshes have one vertex array so this loop only happens once
  printf("iti: %d\n", iti);
  iti++;
  if(it.totindex == 0)
    continue;

  RAS_TexVert *vertex;
  size_t i;

  //iterate over the vertices in the vertex arrays
  for(i=0; i<it.totindex; i++)
  {
     vertex = &it.vertex[it.index[i]];
     const float* v = vertex->getXYZ();
     //move to the vertex (starts at center of object
     glTranslatef(v[0], v[1], v[2]);
     glBegin(GL_QUADS);
     //draw a sphere
     gluSphere(gluNewQuadric(), .2, 10, 10);
     glEnd();
     //move back to center for the next vertex
     glTranslatef(-v[0], -v[1], -v[2]);
   }
}

I admit that I am no expert in OpenGL (or OpenCL) and all my Blender experience comes from using the Python API, so I am learning at least 2 new technologies and one complex software package hence my description of this as a journey. I look forward to getting involved in the Blender development community and learning a lot along the way!

Posted in blender, opencl | 11 Comments

Compiling Blender 2.5

This summer I will be working on improving Blender with OpenCL, specifically integrating and improving particles inside the Game Engine. The first step of course is getting the source code and compiling it. I’m building on two systems, my MacBook with Snow Leopard (10.6) in 32bit (Intel core 2 duo and intel integrated graphics) and on an Ubuntu 9.10 workstation with 4 64bit CPUs and 4GB of memory and a GeForce 8800 GTX.

I followed the build instructions on Blender’s website, but of course nothing is ever that straightforward.

Ubuntu

Lets start with the Ubuntu build:

I followed these instructions to build Blender 2.5 and these to build Blender 2.49b, as well as building Python3.1 from source:

I copied blender/config/linux2-config.py to blender/user-config.py (for scons). I modified a few lines:

BF_PYTHON=’/opt/py31′

WITH_BF_STATICPYTHON=True

This means I statically linked Python 3.1 (I couldn’t get it to work with shared libs, scons complained about undefined symbols: http://pastebin.com/iKGMDpk2 )

Note: I’ve since installed python3-all-dev and now dynamic linking of python3 works, and I avoided having to build python3.1 myself.

to run just do

blender-svn/install/linux2/blender

A note about the necessary dev packages, make sure you have them! If you are getting “file not found” errors while building its probably because you don’t have the dev package you need. These are the ones I used:

sudo apt-get install subversion build-essential gettext \
libxi-dev libsndfile1-dev \
libpng12-dev libfftw3-dev \
libopenexr-dev libopenjpeg-dev \
libopenal-dev libalut-dev libvorbis-dev\
libglu1-mesa-dev libsdl-dev libfreetype6-dev \
libtiff4-dev libsamplerate0-dev libavdevice-dev \
libavformat-dev libavutil-dev libavcodec-dev libjack-dev \
libswscale-dev libx264-dev libmp3lame-dev \
libsdl1.2-dev libopenal-dev \
python3-all-dev

Mac

For the Mac build I used the CMake instructions (scons gave me problems):

I have XCode and developer tools installed for 10.6 (default options). I also followed these instructions and installed ffmpeg through MacPorts

I configured cmake with the GUI. Here is a screenshot of most of the options.CMake config

Note: I configured it WITHOUT blender player or quicktime, having either of these enabled caused problems. I also built for i386 (32bit) as I had problems with 64bit when using scons.

Things aren’t perfect, and 2.5 is only in alpha so we don’t expect it to be! This is just the first baby step getting into Blender’s guts, I hope to have a better understanding of the whole process soon!

Posted in misc | Leave a comment