Hello, World! I’m a big fan of Python (and excited to be attending PyCon!), and I’ve recently started playing with a great module called PyOpenCL. I’ve ported my Part 1 (and Part 1.5) tutorials from C (and C++) to Python using PyOpenCL and things are way simpler as you will see shortly :)
This code runs on my Macbook Pro with a NVIDIA 9600GT as well as on our Dell systems running Ubuntu, one with an ATI FirePro v7800 and one with an NVIDIA GTX480. We are working on getting a Windows machine running a decent GPU in our lab, but any feedback from Windows users would be appreciated!
Let’s Get Started!
You will need to have:
- numpy (matplotlib and scipy aren’t a bad idea either). For Mac users.
- PyOpenCL, Documentation
- if you want GL interop: PyOpenGL (used in the Part 2 tutorial coming soon)
- Mac users will need to build PyOpenCL from source
- The code!
- The OpenCL specifcation and other resources from the Khronos registry
For this tutorial we only have two files! If you look in your advcl folder
python/part1
you will see only two files: main.py and part1.cl, already lookin’ a lot better than C/C++
So let’s give it a shot, just run
python main.py
and you should see
a [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] b [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] c [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18.]
Yay we’ve just added two tiny arrays with PyOpenCL, let’s see how.
First we will look at the OpenCL kernel: part1.cl (exactly the same code as Part 1 and Part 1.5)
__kernel void part1(__global float* a, __global float* b, __global float* c) { unsigned int i = get_global_id(0); c[i] = a[i] + b[i]; }
The kernel is the work that is done by each of the workers in OpenCL. In a GPU these workers are called threads, they are executed in batches called workgroups. So we want to make what would normally be a for loop parallel, so we need to split it up into little pieces of work that can be done at the same time. In this case it is pretty straight forward, we simply make each element of the arrays one piece of data, and make one worker for each of the elements in the output array. OpenCL will then do this bit of work (adding two numbers) with each worker. The way we get it to access and store the right pieces of data is by using the index of the worker (get_global_id) as the index for the data in the arrays.
In order for OpenCL to do the work, we need to give it the data and tell it to execute. Lets take a look at how we do that in main.py
You can see we’ve defined a class, CL which will hold all of our OpenCL related variables. I split the setup and execution into four functions like in my last tutorial:
example = CL() example.loadProgram("part1.cl") example.popCorn() example.execute()
First we construct the class, which sets up our OpenCL context as well as the command queue.
self.ctx = cl.create_some_context() self.queue = cl.CommandQueue(self.ctx)
Right now we are using convenient PyOpenCL abstractions which will work well for getting started. In my next tutorial I will go over what to do here for OpenGL context sharing, and in the future we can talk about using multiple devices.
Next we load our program in from a file, with all the ease and grace of Python
f = open(filename, 'r') fstr = "".join(f.readlines()) self.program = cl.Program(self.ctx, fstr).build()
Notice how we call the build() function on the program object and then store it as a variable. This is what we will use to execute our kernel
But first we need to prepare the data we want to work on, so we turn to the trusty NumPy module
self.a = numpy.array(range(10), dtype=numpy.float32) self.b = numpy.array(range(10), dtype=numpy.float32)
We’ve just initialized two numpy arrays with values from 0 to 9, of course you are going to want to setup your arrays with real data, just make sure that you convert it to numpy arrays with the correct data type before you are ready for OpenCL.
mf = cl.mem_flags self.a_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a) self.b_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.b) self.dest_buf = cl.Buffer(self.ctx, mf.WRITE_ONLY, self.b.nbytes)
Here we create two OpenCL Buffers where we pass in data to be copied to the device right away. We also create a “destination” buffer which we will use to store the results of our computation. For more information on mem_flags see this forum post.
These buffers are now ready to be used by the kernel, so lets see how we execute it:
self.program.part1(self.queue, self.a.shape, None, self.a_buf, self.b_buf, self.dest_buf)
This is one of the sweet things about python, a method has been added to our program instance with the name of our kernel! So now we call it just like any other function, passing in our command queue, the global and local worksizes (in this case our global size is the size of our arrays, and we don’t specify a local worksize, leaving it up to the implementation). We then pass in the three parameters to our kernel, the three OpenCL Buffers we created.
Finally we want to look at the results of our computation so we need to read back the data from dest_buf and print it out:
c = numpy.empty_like(self.a) cl.enqueue_read_buffer(self.queue, self.dest_buf, c).wait() print "a", self.a print "b", self.b print "c", c
We read data from the destination buffer into our c array which is an empy numpy array of the correct size and type. Notice the wait() on the end of enqueue_read_buffer, we could have also put self.queue.finish() on the next line instead, to make sure OpenCL was done copying the data before we tried to print it out.
So there we go! As you can see the code is much simpler than in C/C++ and if you are doing most of your work on the GPU, there won’t be a significant difference in performance between Python and the other languages. The ease of writing in Python has me very excited about prototyping other OpenCL programs with PyOpenCL, and I hope others enjoy it as well!
this was great. i easily followed it. waiting for more!
Pingback: Tweets that mention Adventures in PyOpenCL: Part 1 Getting Started with Python | enj -- Topsy.com
Great! This is exactly what I’ve been looking for! Please continue with this tutorial with more implementations suited for PyOpenCL, I’m sure a lot of people will find it useful, as there currently is a missing stepping stone in learning OpenCL through Python and PyOpenCL.
Thank you!
Woah, that was really simple -I was looking for an easy way to get started with OpenCL, and this is it! thanks!
Brilliant tutorial, please put up more, I am really excited about using OpenCL with Python, seems like the perfect combination and your tutorial is really helpful, please please continue with part 1.5 (and the rest!)
Thank you
Great` tutorial!
easy_install works fine on mac for pyopencl
The enthought python distro is an easy way to install a stack of packages (numpy, mathplot etc etc) no opencl in there yet.
Would love to get this working on gumstix overo and pandaboard.
Its a shame no python on iOS.
Hi:
Thanks for the great tutorial. This is a great toolkit and very easy to use. The only problem I have is that using pyopencl is *slower* than numpy! I’ve been struggling for the last few hours to find a combination of data size/kernel that would show the power of OpenCL. So far no luck..
Any thoughts? (Btw I’m using pyopencl on a MacBook Pro 2.2GHz, running python2.6, and the latest version of pyopencl).
Thanks,
David.
Hi Again:
I think I figured out the problem (thanks to a nice benchmark program I found online:http://git.tiker.net/pyopencl.git/blob_plain/HEAD:/examples/benchmark-all.py). Here’s the result.
Thanks again for the great work. I’m impressed at how easy it was to get this up and running. Nice job!
Cheers,
David.
(‘Execution time of test without OpenCL: ‘, 9.6666891574859619, ‘s’)
===============================================================
(‘Platform name:’, ‘Apple’)
(‘Platform profile:’, ‘FULL_PROFILE’)
(‘Platform vendor:’, ‘Apple’)
(‘Platform version:’, ‘OpenCL 1.0 (Dec 26 2010 12:52:21)’)
—————————————————————
(‘Device name:’, ‘ATI Radeon HD 6750M’)
(‘Device type:’, ‘GPU’)
(‘Device memory: ‘, 512L, ‘MB’)
(‘Device max clock speed:’, 150, ‘MHz’)
(‘Device compute units:’, 5)
Execution time of test: 0.0126587 s
Results OK
===============================================================
(‘Platform name:’, ‘Apple’)
(‘Platform profile:’, ‘FULL_PROFILE’)
(‘Platform vendor:’, ‘Apple’)
(‘Platform version:’, ‘OpenCL 1.0 (Dec 26 2010 12:52:21)’)
—————————————————————
(‘Device name:’, ‘Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz’)
(‘Device type:’, ‘CPU’)
(‘Device memory: ‘, 6144L, ‘MB’)
(‘Device max clock speed:’, 2200, ‘MHz’)
(‘Device compute units:’, 8)
Execution time of test: 0.00191466 s
Results OK
@Al – thanks! installing with setuptools (easy_install) will work, but that note is for Mac users who want GL interop (which is only available in the latest source for now)
@David – thanks for that link. I’ve been planning on a tutorial about timing and what kind of things benefit from OpenCL. I’ll cover it after I writeup my Part 2 (OpenGL particles)
Pingback: Adventures in PyOpenCL: Part 2, Particles with PyOpenGL | enj
Pingback: Adventures in OpenCL Part 3: Constant Memory Structs | enj
the benchmark you mention is so wrong.
you have to compare to numpy operations, not pure python on CPU.
with numpy you can write directly
c_result = (a + b) * (a+b) * (a/2)
my results:
(‘Execution time of test without OpenCL: ‘, 11.889606952667236, ‘s’)
(‘Execution time of test within numpy ‘, 0.00010013580322265625, ‘s’)
===============================================================
(‘Platform name:’, ‘NVIDIA CUDA’)
(‘Platform profile:’, ‘FULL_PROFILE’)
(‘Platform vendor:’, ‘NVIDIA Corporation’)
(‘Platform version:’, ‘OpenCL 1.1 CUDA 4.0.1’)
—————————————————————
(‘Device name:’, ‘Quadro NVS 290’)
(‘Device type:’, ‘GPU’)
(‘Device memory: ‘, 255L, ‘MB’)
(‘Device max clock speed:’, 918, ‘MHz’)
(‘Device compute units:’, 2)
Execution time of test: 0.0360267 s
Results OK
Pingback: Parallel programming in Python | TeachMeLinuxOnline official blog
pm,
You need to run the test 1000 times. Like:
time1 = time()
for i in range(1000):
c_result = (a + b) * (a + b) * (a / 2.0)
time2 = time()
print(“Execution time with numpy: “, time2 – time1, “s”)
You will see OpenCL performs better than NumPy (but NumPy is still pretty good, IMHO). On my machine:
(‘Execution time of test without OpenCL: ‘, 9.549367904663086, ‘s’)
(‘Execution time with numpy: ‘, 0.01725006103515625, ‘s’)
===============================================================
(‘Platform name:’, ‘Intel(R) OpenCL’)
(‘Platform profile:’, ‘FULL_PROFILE’)
(‘Platform vendor:’, ‘Intel(R) Corporation’)
(‘Platform version:’, ‘OpenCL 1.1 LINUX’)
—————————————————————
(‘Device name:’, ‘ Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz’)
(‘Device type:’, ‘CPU’)
(‘Device memory: ‘, 5864, ‘MB’)
(‘Device max clock speed:’, 2000, ‘MHz’)
(‘Device compute units:’, 8)
Execution time of test: 0.000382 s
Results OK
===============================================================
(‘Platform name:’, ‘NVIDIA CUDA’)
(‘Platform profile:’, ‘FULL_PROFILE’)
(‘Platform vendor:’, ‘NVIDIA Corporation’)
(‘Platform version:’, ‘OpenCL 1.1 CUDA 4.2.1’)
—————————————————————
(‘Device name:’, ‘GeForce GT 540M’)
(‘Device type:’, ‘GPU’)
(‘Device memory: ‘, 2047, ‘MB’)
(‘Device max clock speed:’, 1344, ‘MHz’)
(‘Device compute units:’, 2)
Execution time of test: 0.000624512 s
Results OK