LBL Summer Research: Report 2

So I’ve been up in the Berkeley hills for a few weeks now, coding by day and hacking by night. I’ve made some progress at work which I will detail here, which includes some things I’ve been wanting to do for my RTPS project (to be detailed in a post coming soon!).

From LBNL FTLE

A good bit of my time was spent on cleaning up my code, getting everything into (relatively) simple to interface with classes so that next week I can integrate with a coworker’s MPI framework for communication. I finally ported NVIDIA’s OpenCL Radix sort to use my C++ wrappers as well as give key-value sorting functionality (their demo only had sorting of keys). This should mean a nice little speedup for RTPS when I plug it in!

In my last report I had already implemented particle advection and FTLE computation in OpenCL for one GPU. Since then I’ve been developing my code so that each GPU (let’s call it a process) will load a subset of the domain and advects its own set of “seed” particles in that domain. The major issue here is what happens when a particle goes outside of the process’ domain (lets call it subdomain)? Each particle needs to be advected for the same amount of time, and we need to advect them over the entire domain. So when a particle crosses the boundary from one subdomain to another, we need to send it over to the other process so it can continue being advected.

From LBNL FTLE

Additionally, once a particle is done being advected, it needs to go back to its original process so that the FTLE can be computed (each process will do FTLE calculations using their original set of seed particles).

What I’ve implemented so far is a relatively efficient way of determining which particles need to go to which subdomains on the GPU. This is important because copying memory from the GPU to the CPU is expensive (takes a while) and since we want to run a whole helluva lot of particles we don’t want to burden the CPU with checking which particle goes where. If we can do all of our bookkeeping on the GPU (and in parallel) then we only need to copy the particles needed when we want to send them to another process.

I do this by using a technique called spatial hashing, which is when you overlay a (regular) grid onto your domain, and each cell of this grid has a one dimensional index which will be our hash. The simplest way to do this is with cartesian indexing, a topic I intend to make a tutorial post on.

So if each of our subdomains is one of these big grid cells, we can go through and calculate a hash for each of the particles. Then we sort our particles based on their hash values, which puts them into contiguous (next to each other) blocks by subdomain. Now we run a quick kernel that tells us the start and end indices into the particle array for each subdomain (so if the first 150 particles are in cell 0, the start index would be 0, the end index would be 150, or we might have start index 300 end index 1000 for cell 7). Now we can quickly check how many particles moved into each subdomain and copy them out directly using the start and end indices.

Besides moving into other subdomains, particles can also go totally out of bounds which means they wont go anywhere for the rest of the simulation. We can consider them finished, and speaking of finished we also need a way to deal with particles that have run their course. So the same hashing routine that labels the particles by their subdomain also sticks any finished or completely out of bounds particles in an extra hash (its just one more than the total number of subdomains) that our cell indices kernel will recognize. Then we copy those finished particles into a new buffer and sort it by each particle’s original index (a value we keep with each particle). This way we can send finished particles back to their original processes for the FTLE computations.

Next week we start sending the particles around and stomping out the inevitable bugs in my design, it’s like parallel in parallel baby!

LBNL FTLE