I’m trying to push the limits of interactive scientific visualization, so I gotta start somewhere! Ok, so 20 million particles running at ~40fps is not a bad start, but this is also using the simplest rendering possible (GL_POINTS of size 1, with no alpha blending or lighting).
I’m running my code on Ubuntu 10.04 on a Dell Precision T7500 with a NVIDIA GTX480 (1.5GB GDDR5). The machine has 12GB of ram and 8 cores of CPU which don’t get used at all by this (besides to initialize the system and run the main loop of course). Check out the video to see some pretty patterns and a large amount of particles!
A quick note for those waiting for these in Blender, currently I have it working but only with about 100k particles if I use a Mesh as the emitter (any more and Blender starts choking before I get in the game engine). I’m going to start looking at the existing particle interface (as well as the redesign) to start integrating tighter into Blender now that I’m getting comfortable with OpenCL/OpenGL.
Let’s talk some math and look at some numbers, if that doesn’t sound fun it’s ok to stop reading now ;)
So first let me admit that this OpenCL kernel is not doing that much work. I’m solving the Lorenz Attractor ODEs with the RK4 method for each particle. It’s all embarrassingly parallel and there is no interaction between them. Second, I tuned down the rendering so it’s doing as little as possible. This lets us get to the memory limit of the GTX480 at 20 million particles with my setup. Lets see why:
I use float4 arrays (OpenCL variables of 4 32bit floats in a row) for vertices, colors, generators and velocities. I use one float array to keep track of the life of each particle.
4 (arrays) x 20,000,000 (particles) x 4 (floats) x 4 (bytes) = 1,280,000,000 bytes
1 (array) x 20,000,000 (particles) x 1 (float) x 4 (bytes) = 80,000,000 bytes
So thats 1,360,000,000 bytes / 1024E3 = 1.267GB of memory on the graphics card!
Luckily I’m using OpenGL interop, so the vertices and color array are actually VBOs in OpenGL’s context and are modified in place. Since that’s the case we don’t transfer any of that memory back to the CPU, which would be a serious problem. It’s possible my kernel could be optimized more, but right now memory and rendering are the limiting factors. When I start implementing more interesting physics like collision detection and fluid dynamics this will be a bigger issue. I’m also planning to implement depth sorting using an index array VBO so I can render cool effects with proper alpha blending. This of course will also limit the number of particles possible, with my guess being that rendering will be the biggest culprit (not memory).

WOW! just WOW!
Cool stuff :) A particle engine for BGE is welcome indeed. I wish you good luck with your project.
Hi, awesome test. Have you considered the Nvidia Tesla C1060 card? What do you think about GPUs, which one has better performance between GTX480 and Tesla C1060 model? I’m working in swarm-based algorithms implemented with CUDA, but for the time being i have so a Geforce4900 for test.
Best regards!
The effect at 3:20 of the video is especially cool.
@Enj so your convert C++ to OpenCL. I’ve looking for if it already existing to convery C++ to CUDA or OpenCL. But there is the opposite, GPUocelot, Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs and x86-CPUs at full speed without recompilation. http://code.google.com/p/gpuocelot/ Not sure if that would be of use to you.