LBL Summer Research: Report 1

Hello world! I’m excited to be writing to you from the beautiful Bay area in California, specifically the Lawrence Berkeley National Laboratory at the top of the hill in Berkeley.

view of the Bay from Berkeley Lab

This summer I’ll be working with Hank Childs and the visualization team of the Computational Research Division on, excuse my nerdiness, some pretty cool stuff. The underlying goal of my project is to study the use of GPUs for accelerating distributed computations on super computers. The specific topic of my research will be particle advection for FTLE calculations. In English you could think of this topic as trying to find out how fast things separate in a flow, you can imagine dropping fish food in a turbulent stream and seeing where the flakes get separated the fastest. Like my Master’s research I will be using OpenCL and moving around a lot of particles, only for this project I will be moving a lot more particles on much bigger computers. Luckily, many of the efforts I make here will improve both my understanding and some of my underlying code for my Blender project. Already I have reused several pieces from the RTPS library and I plan on releasing my OpenCL wrappers as an improved standalone library very soon (just need to clean up some and add a couple little examples).

Here is an unscientific but interesting picture of my first successful result. After that I will be describing my project and progress in painful (it hurts so good) computational science detail.

first result of FTLE computations in OpenCL

My research problem for the summer is investigating particle advection on the GPU in a distributed memory setting.

The key issues to explore are the costs and benefits of using a GPU in this environment. Since GPUs can greatly accelerate computation but suffer from increased memory latency we want to find out exactly how using the GPU changes the equations governing communication and performance using an interesting visualization problem as the guinea pig. The FTLE computations themselves are already complete and were handed to me by some smart coworkers in my first week. The interesting thing about FTLE is that it requires seed particles to be advected by a velocity field, a common problem when visualizing data from Computational Fluid Dynamics simulations.

My approach to the project is to first implement some code which can accomplish this in three phases and then study the performance in comparison with an existing CPU implementation by a coworker. The first phase is relatively complete, as evidenced by the picture above, which is to implement the FTLE calculations in OpenCL on one GPU. The second phase I am beginning this week is to setup a communications infrastructure using some routines from Visit (which are based on MPI) for transferring particles between nodes and sending them to and from the GPU. The third phase will be combining the two so that calculations are performed and the results are communicated as necessary.

I’m not sure how long it will take me to do phase 2, as I have relatively little experience with MPI type coding. My distributed data experience is limited to website load balancing and I’m not yet sure if there are more similarities or differences.

I am encouraged by my leaders here to write a weekly report. Some of these, such as this first one will be posted on my blog, others may remain internal until the time is right. In any case I’m excited to share as much as possible about what I’m learning and I look forward to pushing OpenCL to new frontiers!

Posted in code, life, opencl | 3 Comments

Can you see the force?

I’ve been working on my Master’s Thesis lately and getting a headache from all the formal writing. So I just want to blog some informal shit real quick. Here are a bunch of screenshots I took playing with my 2D PyOpenCL implementation of SPH (It’s pretty much the exact same OpenCL code thats in my Blender Game Engine project). I’m playing with OpenGL 3.3 Geometry shaders to make the little red arrows. The arrows represent the force exerted on each particle from the SPH calculations. You can see the waves form as particles are repelled by high density areas and attracted to low density areas. The bigger the force the more red the particles get as well, with no force they are just blue.
I built in the ability to add particles with the mouse by clicking, which you can do while the simulation is paused. So I drew my name and then unpaused it:


Here are some fun looking screens I got by creating some crazy density profiles and letting it go: (click through to imgur to see the fullsize screenshots in 2560×1600 resolution)


So quick shout out to Mike Pan for his geometry shader post that got me started. Using Geometry shaders in PyOpenGL wasn’t the most straightforward thing, but I ended up piecing things together from various sources.
Technical note: For some reason I had to specify

        self.program = glCreateProgram()
        glProgramParameteriARB(self.program, GL_GEOMETRY_INPUT_TYPE_ARB, GL_POINTS)
        glProgramParameteriARB(self.program, GL_GEOMETRY_OUTPUT_TYPE_ARB, GL_POINTS) #not sure why this works
        glProgramParameteriARB(self.program, GL_GEOMETRY_VERTICES_OUT_ARB, 20)
        self.compileProgram( self.program, vshader, fshader, gshader)

Even though I’m doing this in the geometry shader

            #version 330
            layout(points) in;
            layout(triangle_strip) out;

Changing GL_POINTS to anything else gives me cryptic errors. Perhaps some OpenGL veterans wouldn’t mind explaining this? I should probably go read the spec, maybe when I’m done writing!
The code is burried deep in one of the 30 branches in my EnjaParticles github repository, its so experimental and mixed in with other crap that I’m not going to directly link to it. I’m planning to use this as a research platform for improving SPH since prototyping (OpenCL) in python is much easier so eventually it will get cleaned up and released.

Posted in misc | Leave a comment

Particles in BGE: Improved Code, Collisions and Hose

Hello world! I’m back with another video to show you guys what we are cooking up here at DSC. I’ve made a lot of internal improvements to the code, but I have a little something to show as far as features too. Check out the video!

[RTPS] Improved Code, Collisions and Hose from Ian Johnson on Vimeo.

So what have I been up to? Mostly I’ve been cleaning up the code, fixing some nasty bugs in OpenCL that were causing crashes. They seem to all be squashed for now so I can start focusing on some key things (besides writing my thesis!) such as fixing up the collisions even more. I also want to take the hose object I’ve got now and make it more interactive so gamers can start spraying things! I’ve been upgrading the UI and exposing more simulation parameters, but there is plenty more work to be done in that area. I look forward to getting it set up in the Particle Panel when the rendering mode is set to Blender Game.

We’ve also been working on getting the project built on windows, which Andrew Young has gotten well underway. He’s got the standalone library compiling and running on windows (even with stereo 3D rendering!) but we still need to link it up to Blender.

I’m in the process of writing this up for my Master’s thesis, which will lead to some extensive documentation. I hope that with a proper writeup it will be easier to collaborate with other Blender developers and get my code better integrated. I do want to thank the Blender community and especially the devs in the IRC channel for all the help so far, it really is amazing to learn so much from people all over the world!

The DSC model used to contain the fluid particles in this picture was created by Martin Lindelöf. Also thanks to Chris Webber and Slikdigit for Blender Python UI inspiration and tutelage. Shout out to Moguri for all his help!

Posted in blender, code, opencl | 16 Comments

Adventures in OpenCL Part 3: Constant Memory Structs

Today we make a brief stop on our journey to examine a technique which has proven useful but not straightforward; passing in a structure as a parameter to an OpenCL kernel as __constant memory. Why would you want to do this? Allow me to spin you a tale of 8 kernels which share an assortment of around 30 parameters many of which are either configurable by the user or can change at runtime based on user input. Throw in the fact that a complex program like this requires plenty of debugging and who knows which parameters will end up at the finish line and which will be left by the wayside. So rather than specify an assortment of variables to pass to each kernel, we define a structure to hold our colorful collection of floats and ints and just throw the whole bundle at each kernel. This is of course convenient for us programmers, at least once structs in OpenCL make some sense to us, and as a bonus it turns out to be a good idea in terms of memory usage as well. So lets get into it!

Grab the code

You can follow along in C++ or Python

This is just a slight adaption of my Part 1 (C++, Python) tutorial code, so check those out if you need help getting started.

So lets take a look at what we want to achieve in OpenCL:

typedef struct Params
{
    float A;
    float B;
    int C;
} Params;

__kernel void part3(__global const float *a,
                  __global const float *b,
                  __global float *c,
                  __constant struct Params* test
                  )
{
    int gid = get_global_id(0);
    c[gid] = test->A * a[gid] + test->B * b[gid] + test->C;
}

You can see on line 16 that all we are doing here is a simple sum, but multiplying each number by a (float) parameter and adding an (int) parameter. You may already have noticed something weird going on in line 12, why is the Params struct a pointer? Excellent question. Even though we only want one structure, we still need to pass it in as a buffer to appease OpenCL (NOTE: This is a problem we’ve come across, we’d love to see sample code which allowed straightforward passing of a struct). So it’s just going to pass a bunch of binary data to the kernel, and then the kernel will have to structure the data when it gets there. And as far as I can tell the only way to get a buffer into a kernel is by passing it in with the pointer syntax you see on line 12 here.
Passing a struct as a buffer is not much different from passing a regular array in C++:

Params params;
params.A = .5f;
params.B = 10.0f;
params.C = 3;
cl_params = cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(Params), NULL, &err);
err = queue.enqueueWriteBuffer(cl_params, CL_TRUE, 0, sizeof(Params), &params, NULL, &event);

In Python (more on the struct module)

params = struct.pack('ffffi', .5, 10., 0., 0., 3)
params_buf = cl.Buffer(ctx, mf.READ_ONLY, len(params))
cl.enqueue_write_buffer(queue, params_buf, params).wait()

And that’s all it takes to have access to our structure in OpenCL!

Update: Please see David Garcias comment below for a correction of this section. A quick experiment showed that it’s not necessary to have padding in this case. I still find this a confusing issue and it would be nice to see some self-contained example code showing cases where alignment breaks.

So about that float padding[2]; in the struct definition. This is because of memory alignment in OpenCL. The best explanation I’ve seen so far is by AndreasStahl which I will briefly summarize in relation to the struct above.
When interpreting a struct, OpenCL accesses the memory in blocks of 16 bytes, which is the same as 4 floats (each 4 bytes). So in our example if we did not have the padding, we would not be able to access our int because opencl would have interpreted it as the 3rd float out of the first 16bytes. This can get even more complicated if you have an array of structs, because then the size of you’re struct will need to be a multiple of 16, as explained in the linked forum post.

So lets talk a little bit about __constant memory. On a GPU this memory is read-only, cached and usually around 64kb per multiprocessor (you can query this, for an example look to the CLInfo example in the AMD SDK and oclDeviceQuery in the NVIDIA SDK). It is closer to the processor than __global and much faster to access, while slightly slower than __local memory, so if you are passing in a ton of them you may be eating into the precious few (48) kilobytes you have, so moving them to __constant memory can free up a little space as well.

Additionally, at least on some implementations there seems to be an arbitrary limit of 9 constant (non-buffer) parameters, which using a struct will help you avoid. (NOTE: as David points out below, the limit is not arbitrary. It can be queried through clGetDeviceInfo() and CL_DEVICE_MAX_CONSTANT_ARGS)

What about other people’s approaches? What kind of trouble have you run into with structs? kernel parameters?

Posted in advcl, code, python, tutorial | 6 Comments

Adventures in PyOpenCL: Part 2, Particles with PyOpenGL

20,000 ParticlesToday we journey as pythonauts into the world of particle systems and fast 3D graphics, manipulating thousands upon thousands of little dots in mere milliseconds! We accomplish this with two Python modules, PyOpenCL and PyOpenGL. This is a port of my C++ OpenCL tutorial, but with much less code and all the splendors of numpy.

What you’ll need:

Let’s get started!

First lets take a look at the files we will be dissecting

- important files
    main.py
    part2.py
    part2.cl
    initialize.py
- support files
    glutil.py
    vector.py
    timing.py

You can go ahead and run python main.py to see the code in action. The way our particle system works is that we have some collection of particles, really just 3D points in space, each with it’s own lifetime. Every time we loop we want to update the positions of our particles based on some set of rules (e.g. gravity, initial velocity) and decrease their lifetime a little. When the lifetime of a particle reaches 0 we set it back to its original position and reset its lifetime. So we want to structure our code so that we initialize the positions of the particles first, then every frame we update and render them.

main.py is responsible for setting up the OpenGL environment using PyOpenGL and GLUT, providing mouse and keyboard interaction as well as a run loop for our program. So now we have some 3D space we can look around by clicking and dragging (left and right mouse buttons) or hit ‘q’ to quit and ‘t’ to print out timing of the update function.

The Setup

Once we have our environment we need to setup some particles, for this look in initialize.py. The first function you will see is function_np which creates several numpy arrays and sets their values using numpy slice operators. If that’s a little confusing, it may be easier to follow the fountain_loopy function which does the same thing but with a (slower) for loop. Essentially we are randomly placing the particles on a flat donut in the x, y plane. They each start with an initial velocity in the same direction as their position and a random lifetime. We also make an RGBA color array so that each particle can have its own color.

Notice the fountain function at the bottom, which just calls the fountain_np function and creates Vertex Buffer Objects (VBO) out of the numpy arrays. Presumably when you go on to make your own OpenCL/OpenGL code you will start with your own numpy arrays and this is a key point for setting up your memory on the GPU. If you want OpenCL to be able to act on OpenGL memory (without making copies) you will need to prepare VBOs (or render buffer objects with clImage which should be the subject of a later tutorial) before you go to OpenCL. Luckily its dead simple with PyOpenGL

    from OpenGL.arrays import vbo
    pos_vbo = vbo.VBO(data=pos, usage=GL_DYNAMIC_DRAW, target=GL_ARRAY_BUFFER)
    pos_vbo.bind()

Now lets look in main.py to see how the initialize functions are called and the results are passed into OpenCL.

     #set up initial conditions
     (pos_vbo, col_vbo, vel) = initialize.fountain(num)
     #create our OpenCL instance
     self.cle = part2.Part2(num, dt)
     self.cle.loadData(pos_vbo, col_vbo, vel)

Here num and dt are defined at the top of main.py, which are user parameters for how many particles to make and essentially how much simulation time passes each frame. Now we get to the OpenCL, we’ve made an instance of a Part2 class. Then we call loadData with the vbos and velocity array.

Interfacing with OpenCL

Let’s take a look at what the constructor and loadData are doing in part2.py

     self.clinit()
     self.loadProgram("part2.cl");
     self.num = num
     self.dt = numpy.float32(dt)

Notice how we explicitly “cast” dt to a numpy float, this is required to pass variables (non-buffers) to a PyOpenCL kernel.

Now let’s take a minute to look at what clinit is doing, for it is responsible for setting up our CL Context using the existing GL context.

    plats = cl.get_platforms()
    from pyopencl.tools import get_gl_sharing_context_properties
    import sys
    if sys.platform == "darwin":
        self.ctx = cl.Context(properties=get_gl_sharing_context_properties(),
                             devices=[])
    else:
        self.ctx = cl.Context(properties=[
            (cl.context_properties.PLATFORM, plats[0])]
            + get_gl_sharing_context_properties(), devices=None)

    self.queue = cl.CommandQueue(self.ctx)

First we get a list of available OpenCL platforms on the machine and then import a handy function provided by PyOpenCL which abstracts the messy business[pdf] of setting up the right properties for sharing a GL context. Due to the eccentric whims of Apple (hey, they pretty much came up with OpenCL) the way you create the context on Mac OS X is slightly different, hence the if statement checking for “darwin”. After that its back to business as usual by creating a CommandQueue from the context. If you are using PyOpenCL .92 or beta2011 you will need to replace the contents of the clinit function with this code.

So loadProgram is the same as in Part 1 where we simply read in the file to instantiate and build a program object. So lets skip that and get straight into loading the data:

def loadData(self, pos_vbo, col_vbo, vel):
        import pyopencl as cl
        mf = cl.mem_flags

        #... cut out the saving of variables to self ...

        #Setup vertex buffer objects and share them with OpenCL as GLBuffers
        self.pos_vbo.bind()
        self.pos_cl = cl.GLBuffer(self.ctx, mf.READ_WRITE, int(self.pos_vbo.buffers[0]))
        self.col_vbo.bind()
        self.col_cl = cl.GLBuffer(self.ctx, mf.READ_WRITE, int(self.col_vbo.buffers[0]))

        #pure OpenCL arrays
        self.vel_cl = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=vel)
        self.pos_gen_cl = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.pos)
        self.vel_gen_cl = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.vel)
        self.queue.finish()

        # set up the list of GL objects to share with opencl
        self.gl_objects = [self.pos_cl, self.col_cl]

The key here is on line 33 in creating OpenCL buffers from the vbo objects using the cl.GLBuffer class. Notice how we do not pass in the vbo object itself, but the actual integer value of the buffer as it is represented by OpenGL. We create normal OpenCL buffers as we did in Part 1 simply by passing in numpy arrays (♥). The other significant change is defining the gl_objects list which we will use in execute!

Execute!

def execute(self, sub_intervals):
        cl.enqueue_acquire_gl_objects(self.queue, self.gl_objects)

        global_size = (self.num,)
        local_size = None

        kernelargs = (self.pos_cl,
                      self.col_cl,
                      self.vel_cl,
                      self.pos_gen_cl,
                      self.vel_gen_cl,
                      self.dt)

        for i in xrange(0, sub_intervals):
            self.program.part2(self.queue, global_size, local_size, *(kernelargs))

        cl.enqueue_release_gl_objects(self.queue, self.gl_objects)
        self.queue.finish()

The first thing to point out is that we are acquiring the gl_objects before we pass them in as arguments to the kernel. This makes sure that OpenGL is not using the buffers for anything and allows us to read from and write to them. We are setting our global workgroup size to the length of our arrays, essentially saying that each thread global workitem will be one element of our original array.
We also introduce sub intervals, allowing us to perform a variable number of updates per frame. This means one could make dt smaller, generally making the simulation more accurate but not slow down the desired motion of the particles. For example if you decrease dt from .01 to .001 you will be making the particles move 10 times slower every iteration, so to keep the visual speed the same you would do 10 sub iterations.

The Kernel

So after all that setup we are ready to run our kernel, found in part2.cl.

This is also the exact same kernel as used in my C++ version.

So there you have it, the essentials of interoperating between OpenCL and OpenGL in Python! I didn’t cover what I did in my utility files but those are subjects of later posts. You should be able to poke around to see whats going on, and as far as rendering I’m using very basic GL calls for VBOs which there are other tutorials for.

I’d like to shout out to Keith Brafford for helping test and refactor this code on Windows as well as the PyOpenCL patch I worked on to get GL interop working on the Mac. Of course this tutorial wouldn’t be possible without the valiant efforts of Andreas Klöckner!

As a sample of what’s possible I implemented a solver for the 1D Wave equations as described here.

Posted in advcl, code, opencl, python, tutorial | 17 Comments

pycon 2011

sweet shirt from Google

Andrew, Nathan and I just got back from Atlanta for Pycon 2011 and it was a blast. It was a great mix of learning, hacking, networking and free swag. I have never been picky about free t-shirts before, but there were so many badass shirts that I actually walked past a booth with well designed shirts because I already use AWS and don’t want to rep any other hosting provider… Google was giving out cool swag, and I actually won a raffle for a Galaxy Tab which I will be compiling my RTPS particles to shortly. We accumulated more than just free stuff, there was much knowledge to be gained from the talks and the Birds of a Feather sessions. There were a few really great talks, namely David Beazley on turning his 1979 SuperBoard into a cloud computing platform.  This guy is the epitome of a hacker and at the same time an excellent presenter, he takes you on a twisted journey using beautiful and powerful tools to operate on ancient technology to create a frankenstein monster with the byproduct of his sick experiment being five packages ported to python3. Not only does he explain some awesome technical feats, he does it in a way which makes you feel like you could too, and he gives just enough detail to entice you and not too much to bore. I found that the best talks were those who really love doing what they do, and the excitement shines from them more than the projector. In particular I really liked Christopher Webber‘s talk on the Blender Python API bpy as well as Yung-Yu Chen’s talk on SOLVCON. I do not want to sound overly critical or negative, but I do think it should be mentioned that we were underwhelmed with more than a few presentations. Granted public speaking is not easy, especially when you speak on inevitably complex topics but the quality of many presenters (but not the quality of their content) left something to be desired. Perhaps at a conference about such an elegant language I would hope people would express themselves with more clarity. In any case, there was a huge amount of knowledge encapsulated in the talks (not all of them seem to be in this archive).

The Bird’s of a Feather were something else entirely, I found them all incredibly informative, met some really smart and knowledgable people and exchanged some cool ideas. The Scientific Computing, Visualization and GPU Computing sessions all had around 20+ people discussing various techniques, problems and libraries they experience in several interesting and disparate fields. A bunch of Blendheads got together Saturday night and did some show-and-tell along with lots of great discussions. I was lucky enough to be schooled on bpy and UI coding by none other than slikdigit and paroneayea and I’m inspired to refactor my current efforts to expose much more of my library to Python. It’s amazing what artists can do with the power of Python in their palette.

One last amazing thing was how many people were hiring, I don’t think there wasn’t an exhibitor who didn’t open the conversation with “how far along are you in your studies,” even the ones selling services to developers. Apparently there is no recession for Python developers.

All in all, I’m glad I’m a pythonaut. I definitely want to thank my advisor Gordon Erlebacher for sending us to represent DSC!

Posted in code, community, python, travel | Leave a comment

A Python Function Timing Decorator

I recently went up to North Carolina to visit a good friend of mine, and we got to talking Python as we are wont to do. He told me about a PEP for enhancing generator syntax and creating cofunctions. To get ready for these we needed to practice using coroutines. I went about this by enhancing my function timing decorator and making it into a class so I could collect and analyze timings on many different functions. Check out the gist, hopefully it will not make you think me a fool, I will discuss my thoughts on it below the code and remove all doubt.

I personally can’t get over how cool decorators are… but I digress. The key part is in the __collector function on line 11 where we collect data anytime a decorated function is called. So our dictionary of function names and timing information is being generated rather than simply populated. This allows us to lump together any special processing we want to do during the collection of a timing, such as adding to the count of timings or keeping a running total.
So this is cool because we can create a coroutine which is like building a generator, but cofunctions are cool because they allow you to essentially “go both ways,” meaning you can have a generator that generates values, but also receives values from another generator. Once you have this two way yielding communication you can do sweet stuff like a scheduler, which excites me because OpenCL is designed with an event driven interface, meaning one could setup a really slick OpenCL accelerated workflow using cofunctions with PyOpenCL.

Posted in code, python | 1 Comment

Hijacking the BGE for Fun and Particles

Hello world! Care to take a journey into the annals of my twisted mind? Join me on a ride through a small part of the Blender Game Engine on a rickety roller coaster of code and commentary. For those interested in modifying the BGE source I hope this experience will give you ideas, and my personal aspiration is that those more knowledgable than myself will suggest ways of improving what I’ve done here. You must be at least proficient with C++ to ride this ride.

The ride I am talking about is the modifications I’ve made to the Blender Game Engine to add a real-time fluid simulation as part of my Master’s research. Right now the code is seperated into two repositories:

  • BGERTPS ([B]lender [G]ame [E]ngine – [R]eal [T]ime [P]article [S]ystem)
  • EnjaParticles (the standalone library containing the OpenCL Fluid Simulator)

We will primarily be focusing on the code in BGERTPS, but you will need both if you want to compile and run the code. If you’re mind has a built in C++ preprocessor, you may want to quit reading and check the list of modified files to see the damage I’ve done for yourself.

So let’s start with a little visual of the interface before we dig into the code. Here you see the modifier panel of our Domain object which controls parameters for the system. Our fluid will never leave this domain, so make sure it is big enough to hold objects you may want to interact with. The size of our particles relevant to our domain is determined by the maximum number of particles you specify. Of course smaller particles means better simulation resolution, but it also means more computation. For now the user will have to experiment with the capabilities of their machine, but in the future we hope to have guidelines and it would even be possible to automatically suggest particle counts based on available hardware. As a note, due to a limitation in the sorting code we are using right now the maximum number of particles must be a power of 2 (1024, 2048, 4096, 8192, 16384… le sigh).

RTPS Modifier UI

Now let’s actually add some fluid to the domain by creating a new object within the domain and giving it a few game properties in the Logic Panel.

Emitter UI

When the game starts the object’s bounding box will be filled with up to num equally spaced particles, if the volume of the bounding box is too small to hold num particles, then it will only inject however many particles are necessary to fill the volume. In the game, after the particles have been emitted, the num property is set to 0. Therefore if one would like to emit more particles, it is possible to use a Property Actuator to set the num property to a number, at which point the game engine will attempt to emit that number of particles and set the property back to 0.

It is also possible to have the particles collide against triangle meshes, so if you want the fluid to interact with an object make sure to switch to Edit mode, select all and hit ctrl-T to turn all the faces to triangles. It is also important to know which way your normals are facing. Once you have those right, it is as simple as adding a collide boolean to your object’s game properties. At the moment physics of the object are not shared with the fluid, meaning that forces do not transfer between them. We also have not optimized these collisions, so too many triangles (greater than a few hundred on most systems) will start slowing things down a bit.

Collision UI

There are also example blend files in EnjaParticles in the blender folder, like rtps_dam_demo.blend that could be used as a starting point. So that’s how you use it, but how do we make it, and are there better ways to do it?

RTPS API

Let’s start with the modifier. I’m going to skip how I made my RTPS modifier because I’ve detailed it line by line before. The important thing to start with is what the RTPS class needs for it to work, so let’s look at the API I use so far. (Everything in the RTPS library is in the rtps namespace)

First is the constructor (in BL_ModifierDeformer.cpp):

rtps::RTPSettings settings(sys, rtmd->num, rtmd->dt, grid, rtmd->collision);
(*slot)->m_pRTPS = new rtps::RTPS(settings);

So here we create an RTPSettings object, as well as an RTPS object. I’ll explain the (*slot) business in a bit but for now let’s just say we are saving a pointer to our RTPS object. So what are the settings?

  • sys: the type of system, for fluids we have rtps::RTPSettings::SPH
  • rtmd->num: the maximum number of particles
  • rtmd->dt: the time step we use to integrate the system
  • grid: the bounding box of our domain
  • rtmd->collision: whether or not to perform collision checks

The settings object is simply passed to the constructor of the RTPS class which creates the particle system we specify. If we want to add any particles to the system (presumably we do) we can call the following function (also in BL_ModifierDeformer.cpp)

rtps->system->addBox(nn, min, max, false);

Here the parameters are as follows:

  • nn: the number of particles to (attempt) to add
  • min: the “bottom left” corner of the bounding box (as a float4)
  • max: the “upper right” corner of the bounding box (also a float4)
  • false: a boolean specifying whether to scale the values for the simulation

Before we go into detail about how we construct and populate our particle systems, there are two quick functions we need to mention, namely:

(*slot)->m_pRTPS->update();

which does all of the simulation work every frame, and then there is (in RAS_ListRasterizer.cpp):

if(ms.m_bRTPS)
    ms.m_pRTPS->render();

and

if(ms.m_bRTPS)
    ms.m_pRTPS->render();

which handles all of the OpenGL rendering of our particles (and subsequently bypasses the normal Blender rendering of the domain). One bug here is that if you are in wireframe mode, the particles won’t render.

Modifier

Now lets go back to creating our system and seeing how we keep track of it in the game. Everything comes down to our RTPS modifier, an object with this modifier defines a system and provides a reference to it. We exploit the fact that modifiers have an Update routine which gets called every frame of game play, and we use that to tell our system what to do. So how does our system get attached to an object? Let’s start with how the modifier gets applied.
This happens in KX_BlenderDataConverter.cpp in the gameobject_from_blenderobject function:

bool bIsRTPS = BL_ModifierDeformer::HasRTPSDeformer(ob);

and

BL_ModifierDeformer *dcont = new BL_ModifierDeformer((BL_DeformableGameObject *)gameobj,
kxscene->GetBlenderScene(), ob,	meshobj, bIsRTPS);

This function is getting called on all the objects in the scene, so we want to make sure our object knows that it is an RTPS by making its ModifierDeformer aware of the fact. The first step is seeing if the current object has our RTPS modifier, with HasRTPSDeformer(obj) which is defined in BL_ModifierDeformer.cpp on line 147. The boolean value is stored as a member of the ModifierDeformer instance as defined in BL_ModifierDeformer.h:

bool    m_bIsRTPS; //different from the RAS_MaterialBucket flag but used to set it.

As you can see from the comment, we will use a similar boolean in RAS_MaterialBucket.h to signify whether or not a RAS_MeshSlot has our system:

bool    m_bRTPS;
rtps::RTPS*     m_pRTPS;

The RAS_MeshSlot is where the real action is, these are what the game engine uses to keep track of objects, at least as far as my understanding is concerned for modifiers and rendering. Now that we can keep a pointer to our system in a MeshSlot, we can access it in several places where we need it. Lets start with RAS_MaterialBucket.cpp which of course has the constructor (line 63) and destructor (line 90) as well as foregoing the application of OpenGL transformations (we manage our own OpenGL stuff inside the library) on line 618.
As we saw earlier the MeshSlot’s m_bRTPS boolean is used to call the render function in RAS_ListRasterizer.cpp on lines 234 and 269.

More interestingly, let’s look at how the boolean and the pointer are set when the modifier is applied in BL_ModifierDeformer.cpp:

if(m_bIsRTPS)
{
    (*slot)->m_bRTPS = true;

    ModifierData* md;
    for (md = (ModifierData*)m_objMesh->modifiers.first; md; md = (ModifierData*)md->next)
    {
        if(md->type & eModifierType_RTPS)
        {
            RTPSModifierData* rtmd = (RTPSModifierData*)md;

            rtps::RTPSettings::SysType sys = (rtps::RTPSettings::SysType)rtmd->system;

            //get the bounding box of the object for use as domain bounds
            KX_GameObject* gobj = (KX_GameObject*)m_gameobj;
            MT_Point3 bbpts[8];
            gobj->GetSGNode()->getAABBox(bbpts);
            MT_Point3 min = bbpts[0];
            MT_Point3 max = bbpts[7];
            using namespace rtps;
            rtps::Domain grid(float4(min.x(), min.y(), min.z(), 0), float4(max.x(), max.y(), max.z(), 0));

            if (sys == rtps::RTPSettings::SPH)
            {
                rtps::RTPSettings settings(sys, rtmd->num, rtmd->dt, grid, rtmd->collision);
                (*slot)->m_pRTPS = new rtps::RTPS(settings);
            }

Where slot is defined in a for loop over the list of MeshSlots in the lines that precede this excerpt. First we set our boolean flag to true (383), then we loop over the modifier data on our object (it is conceivable that our object could have more than one modifier) until we find the RTPS modifier (388). So we get the modifier data (390) which contains the values input into the UI, starting with what type of system we are making. Then we get the Axis Aligned Bounding Box for our object to construct the domain for the simulation (395-401). Finally we construct our RTPS object and save the pointer to our slot (406). If you look at the actual code you will see that there are a couple other systems that are possible, one of which my colleague is using to bring superfast boids to the game engine, but I’ll let her write about them!

Modifier Update

So now that we’ve got a system created, we want to simulate some fluids, preferably every frame. So let’s turn our attention to the Update function of our modifier, still in BL_ModifierDeformer.cpp we again loop over the slots and find our system and make sure it’s pointer is valid

 if((*slot)->m_bRTPS && (*slot)->m_pRTPS)
{
    rtps::RTPS* rtps = (*slot)->m_pRTPS;

Here we make a convenience variable to refer to our system for the rest of the Update. Next we skip a few lines that aren’t being used right now, and move on to interacting with other objects in the game!

if(rtps->settings.system == rtps::RTPSettings::SPH)
{
    CListValue* oblist = kxs->GetObjectList();
    int num_objects = oblist->GetCount();

The first thing we need to do is make sure we are dealing with the right kind of system, other systems might want to interact differently. The real work starts when we get the list of objects from the game engine.
A quick note, if you look in the code you might see timers sprinkled around:

timers[TI_EMIT]->start();

these are defined in the rtps library and this one was setup in the ModifierDeformer constructor.
On the next line you will see a vector of Triangles:

std::vector triangles;

which we will use to store all of the triangles from the meshes of objects we wish to collide against.

So lets loop over the objects in the game and check for properties we care about:

for(int iob = 0; iob < num_objects; iob++)     {         KX_GameObject* gobj = (KX_GameObject*)oblist->GetValue(iob);
        STR_String name = gobj->GetName();
        //printf("obj: %s\n", name.Ptr());

So we have our game object gobj, and we can print out it’s name to give us an idea of whats going on (of course this will print out the name for all objects every frame, so you only want to do this while debugging and figuring things out)

Collisions

Now let’s see how we interact with collider objects:

//Check if object is a collider
bool collider = false;
CBoolValue* boolprop = (CBoolValue*)gobj->GetProperty("collider");
if(boolprop)
{
    //printf("obj: %s, collider: %d\n", name.Ptr(), boolprop->GetBool());
    collider = boolprop->GetBool();
    if(collider)
    {
        getTriangles(gobj, triangles);
    }
}

Here we get the boolean value of our collider game property, and if it is true we populate the triangles vector with the triangle faces of the game object by calling the getTriangles function:

int getTriangles(KX_GameObject* gobj, std::vector &triangles)
{
    //get the mesh and the derived mesh so we can get faces/verts
    RAS_MeshObject* meshobj = gobj->GetMesh(0);
    Mesh* mesh = meshobj->GetMesh();
    DerivedMesh *dm = CDDM_from_mesh(mesh, NULL);

    //for now we are assuming triangles
    int n_faces = dm->getNumFaces(dm);
    MFace* face = dm->getFaceArray(dm);

    MT_Matrix3x3 grot = gobj->NodeGetWorldOrientation();
    MT_Point3 gp = gobj->NodeGetWorldPosition();

    Triangle tri;
    MT_Vector3 fv, ftv;

    for(int i = 0; i < n_faces; i++)
    {
        getTriangle(face[i], grot, gp, dm, tri);
        triangles.push_back(tri);
    }
    return triangles.size();
}

This function loops over all of the faces in the derived mesh and gets the individual Triangle struct for each one. These triangles are in global coordinates (rather than local to the object) which we accomplish by applying the global transformation (rotation and translation) to them in the getTriangle function:

void getTriangle(MFace& face, MT_Matrix3x3& grot, MT_Point3& gp, DerivedMesh* dm, Triangle& tri)
{
    MT_Vector3 fv, ftv;
    float mv[3];

    dm->getVertCo(dm, face.v1, mv);
    //rotate and translate by global rotation/translation to get global coords
    fv = MT_Vector3(mv[0], mv[1], mv[2]);
    ftv = grot*fv + gp;
    tri.verts[0] = float4(ftv.x(), ftv.y(), ftv.z(), 1);

    dm->getVertCo(dm, face.v2, mv);
    fv = MT_Vector3(mv[0], mv[1], mv[2]);
    ftv = grot*fv + gp;
    tri.verts[1] = float4(ftv.x(), ftv.y(), ftv.z(), 1);

    dm->getVertCo(dm, face.v3, mv);
    fv = MT_Vector3(mv[0], mv[1], mv[2]);
    ftv = grot*fv + gp;
    tri.verts[2] = float4(ftv.x(), ftv.y(), ftv.z(), 1);

    float4 e1 = float4(tri.verts[1].x - tri.verts[0].x, tri.verts[1].y - tri.verts[0].y, tri.verts[1].z - tri.verts[0].z, 0);
    float4 e2 = float4(tri.verts[2].x - tri.verts[0].x, tri.verts[2].y - tri.verts[0].y, tri.verts[2].z - tri.verts[0].z, 0);
    tri.normal = normalize(cross(e1, e2));
}

Here we simply get the vertex coordinates of the face and transform them with the rotation matrix and translation vector. This is nice and easy because the moto library provides operator overloading for doing matrix and vector operations. Finally we compute the normal using normalize and cross functions defined in the RTPS library.

For the collisions to happen we need to tell our system about it, which we do after we are done with looping over objects and right before the update call:

if(triangles.size() > 0 && rtps->settings.tri_collision)
{
    rtps->system->loadTriangles(triangles);
}

Emitters

We jumped ahead a little bit, before we get to the update call and we are done with checking for the collider property, we do emitters by checking if an object has a num property:

//Check if object is an emitter
//for now we are just doing boxes
CIntValue* intprop = (CIntValue*)gobj->GetProperty("num");
if(intprop)
{
    //get the number of particles in this emitter
    int num = (int)intprop->GetInt();
    if (num == 0) { continue;} //out of particles

    int nn = makeEmitter(num, gobj);
    if( nn == 0) {continue;}
}

After we get the value of num we first we make sure we aren’t trying to emit 0 particles. The makeEmitter function is defined on line 435 but has a bunch of extra dev stuff, with the important part being the following lines:

nn = num;
CIntValue *numprop = new CIntValue(0);
gobj->SetProperty("num", numprop);
return nn;

Which simply sets the num property to 0 and returns the original value. This is more complicated than it needs to be because the makeEmitter function will add more capabilities in the future such as turning an object into a “hose” to spray particles. This effect can be simulated for now by setting the num property to the number of particles you would like to emit over and over again (because every frame that the num property has a value greater than 0, that many particles will be emitted and then it will be set back to 0).
Finally we get the object’s bounding box in order to call the API function for adding a box of particles to our system.

    MT_Point3 bbpts[8];
    gobj->GetSGNode()->getAABBox(bbpts);
    float4 min = float4(bbpts[0].x(), bbpts[0].y(), bbpts[0].z(), 0);
    float4 max = float4(bbpts[7].x(), bbpts[7].y(), bbpts[7].z(), 0);
    rtps->system->addBox(nn, min, max, false); 

Conclusion

Whew, so I’ve taken you through just about all the changes I’ve made to the Blender source code. The one thing I haven’t really touched on is the includes I added in BL_ModifierDeformer.cpp (lines 59-71) some of which I am hoping for insight on, because it may not be good to include a KX_* header in a BL_* file. Other than that I added #include “RTPS.h” or #include “timege.h” to some of the various files I’ve gone over. The MODIFIED_FILES document should cover all the files I’ve changed, including CMake related changes. I try to keep my git mirror up-to-date with the blender git mirror so one could also use git or svn tools to see my exact changes.

Again, I hope this detailed breakdown will be helpful to someone! I’d like to thank Moguri, jbakker and dfelinto from the Blender community for their invaluable help so far. As the material from this post will most likely end up in my Master’s thesis, I’d like to shout out to my advisor Gordon Erlebacher my colleague Evan Bollig (for all kinds of CMake and programming help) and the Department of Scientific Computing!

If you made it this far, we got love for the same thing, so holler when you see me on IRC (enjalot) or gtalk :D

Posted in blender, code | 2 Comments

Adventures in PyOpenCL: Part 1 Getting Started with Python

Hello, World! I’m a big fan of Python (and excited to be attending PyCon!), and I’ve recently started playing with a great module called PyOpenCL. I’ve ported my Part 1 (and Part 1.5) tutorials from C (and C++) to Python using PyOpenCL and things are way simpler as you will see shortly :)

This code runs on my Macbook Pro with a NVIDIA 9600GT as well as on our Dell systems running Ubuntu, one with an ATI FirePro v7800 and one with an NVIDIA GTX480. We are working on getting a Windows machine running a decent GPU in our lab, but any feedback from Windows users would be appreciated!

Let’s Get Started!

You will need to have:

For this tutorial we only have two files! If you look in your advcl folder

python/part1

you will see only two files: main.py and part1.cl, already lookin’ a lot better than C/C++

So let’s give it a shot, just run

python main.py

and you should see

a [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
b [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
c [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.]

Yay we’ve just added two tiny arrays with PyOpenCL, let’s see how.
First we will look at the OpenCL kernel: part1.cl (exactly the same code as Part 1 and Part 1.5)

__kernel void part1(__global float* a, __global float* b, __global float* c)
{
    unsigned int i = get_global_id(0);
    c[i] = a[i] + b[i];
}

The kernel is the work that is done by each of the workers in OpenCL. In a GPU these workers are called threads, they are executed in batches called workgroups. So we want to make what would normally be a for loop parallel, so we need to split it up into little pieces of work that can be done at the same time. In this case it is pretty straight forward, we simply make each element of the arrays one piece of data, and make one worker for each of the elements in the output array. OpenCL will then do this bit of work (adding two numbers) with each worker. The way we get it to access and store the right pieces of data is by using the index of the worker (get_global_id) as the index for the data in the arrays.

In order for OpenCL to do the work, we need to give it the data and tell it to execute. Lets take a look at how we do that in main.py

You can see we’ve defined a class, CL which will hold all of our OpenCL related variables. I split the setup and execution into four functions like in my last tutorial:

example = CL()
example.loadProgram("part1.cl")
example.popCorn()
example.execute()

First we construct the class, which sets up our OpenCL context as well as the command queue.

self.ctx = cl.create_some_context()
self.queue = cl.CommandQueue(self.ctx)

Right now we are using convenient PyOpenCL abstractions which will work well for getting started. In my next tutorial I will go over what to do here for OpenGL context sharing, and in the future we can talk about using multiple devices.

Next we load our program in from a file, with all the ease and grace of Python

f = open(filename, 'r')
fstr = "".join(f.readlines())
self.program = cl.Program(self.ctx, fstr).build()

Notice how we call the build() function on the program object and then store it as a variable. This is what we will use to execute our kernel

But first we need to prepare the data we want to work on, so we turn to the trusty NumPy module

self.a = numpy.array(range(10), dtype=numpy.float32)
self.b = numpy.array(range(10), dtype=numpy.float32)

We’ve just initialized two numpy arrays with values from 0 to 9, of course you are going to want to setup your arrays with real data, just make sure that you convert it to numpy arrays with the correct data type before you are ready for OpenCL.

mf = cl.mem_flags
self.a_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a)
self.b_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.b)
self.dest_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, self.b.nbytes)

Here we create two OpenCL Buffers where we pass in data to be copied to the device right away. We also create a “destination” buffer which we will use to store the results of our computation. For more information on mem_flags see this forum post.

These buffers are now ready to be used by the kernel, so lets see how we execute it:

self.program.part1(self.queue, self.a.shape, None, self.a_buf, self.b_buf, self.dest_buf)

This is one of the sweet things about python, a method has been added to our program instance with the name of our kernel! So now we call it just like any other function, passing in our command queue, the global and local worksizes (in this case our global size is the size of our arrays, and we don’t specify a local worksize, leaving it up to the implementation). We then pass in the three parameters to our kernel, the three OpenCL Buffers we created.

Finally we want to look at the results of our computation so we need to read back the data from dest_buf and print it out:

c = numpy.empty_like(self.a)
cl.enqueue_read_buffer(self.queue, self.dest_buf, c).wait()
print "a", self.a
print "b", self.b
print "c", c

We read data from the destination buffer into our c array which is an empy numpy array of the correct size and type. Notice the wait() on the end of enqueue_read_buffer, we could have also put self.queue.finish() on the next line instead, to make sure OpenCL was done copying the data before we tried to print it out.

So there we go! As you can see the code is much simpler than in C/C++ and if you are doing most of your work on the GPU, there won’t be a significant difference in performance between Python and the other languages. The ease of writing in Python has me very excited about prototyping other OpenCL programs with PyOpenCL, and I hope others enjoy it as well!

Posted in advcl, opencl, tutorial | 12 Comments

class useless(xkcd):


Happy Valentines Day!

with all due respect to the original classic

Posted in life | 3 Comments