GPU code

LO1ZB · (This post was last modified: 06-19-2014, 07:26 AM by LO1ZB.)

Which directorys do i have to link to "OPENCL_INCLUDE_DIR" and "OPENCL_LIBRARY_PATH"?

worktycho · 06-19-2014, 07:28 AM

OPENCL_INCLUDE_DIR is the location of the openCL headers. It will be somewhere in the cuda toolkit program files and contain a folder CL which will contain various header files, at a minimum cl.hpp. Look for folders named include. OPENCL_LIBRARY path is the location of the openCL.lib file. Again it should be somewhere in the cuda tookit program files. probably in a folder called lib.

LO1ZB · (This post was last modified: 06-19-2014, 07:36 AM by LO1ZB.)

As I said, there is no "cl.hpp"

worktycho · 06-19-2014, 07:51 AM

Okay, I've added a copy of the cpp wrapper to the repo. You still need to set the parent of that directory as the include directory for the actual APIs though.

LO1ZB · (This post was last modified: 06-19-2014, 09:18 AM by LO1ZB.)

Testet with ChunkWorx:
618 ch/s without GPGPU
678 ch/s with GPGPU
I change the values for QUEUE_WARNING_LIMIT and QUEUE_SKIP_LIMIT to 50000.

xoft · 06-19-2014, 04:33 PM

worktycho, I have a feeling you're going about this the wrong way.
Most of the generators do their calculations on arrays of values; each has its calculation different, but the basic principle is similar: Get an array of values, fill them with noise, do some basic calculations on all those values, add another noise etc., and finally upscale the array to the chunk's size. I think this is the abstraction that you should be aiming for, rather than implementing each generator in separate GPGPU code, provide the cValueArray2D and cValueArray3D classes (templates, actually, based on the upscaling factor) that implement those calculations in GPGPU (lazily - on evaluation, rather than on calling the operation) and that primitive can be then used for all generators, with only minimal code being duplicated.

worktycho · (This post was last modified: 06-19-2014, 10:09 PM by worktycho.)

Doing that for GPU is relatively easy as its runtime compiled. The problem with doing that is building the CPU side. For the CPU the biggest problem is memory bandwidth so the one thing you don't want to do is generate code like this:

Code:
for(int i = 0; i < chunkSize; i++)

{

     a[i] = b[i] + c[i];

}

for(int i = 0; i < chunkSize; i++)

{

     d[i] = a[i] * e[i];

}

for(int i = 0; i < chunkSize; i++)

{

     f[i] = d[i] - 4;

}

Compilers don't optimise this sort of code and CPUs hate it. There are only two portable ways around this that I can think of. One is the monadic template code I was working on and there isn't a compiler out there can compile it at the moment. The templates are just to deep. The other is to use abstractions so large that the unfused loops aren't to much of a problem. That means rewriting the algorithms to avoid the basic calculations or using a interface that allows the basic calc's to be supplied as lambdas and openCL strings like this:

Code:
Array.Map([](int a, int b, int c){a + b * c}, "a + b * c");

Though I agree that duplication is not the way to go about. The problem is that short of implementing a C to C compiler or rewriting the generator in a custom DSL I can't see a portable way of avoiding duplication.

xoft · 06-20-2014, 12:43 AM

Is there a way for the GPU to leave the results in memory so that another operation can use them?
So that when we have calls like

cValueArray<...> Values(...);
Values.GeneratePerlin(...)
Values.AddConstant(1)
Values.Multiply(2)
Values.Evaluate();

the Evaluate function can use a buffer it uploads to the GPU and then call virtual functions of the operations, such as GeneratePerlin(), that operate on that GPU-side buffer, and finally pull the buffer once all the operations are complete?
Yes, we're still accessing the memory in the wrong pattern, which could be improved upon, but we're not losing our modularity.

worktycho · (This post was last modified: 06-20-2014, 01:03 AM by worktycho.)

Again GPU side is easy. Just use a cache object and callback and keeping memory on GPU is also not difficult. GPU wise I could write an api with minimal additional cost. Because you build up an AST for the codegen you can optimise the memory access pattern. If the operations are pure then the optimisation is easy. The problem is when executing on platforms without GPUs like cheap servers. I experimented with this sort of interface a while back and it costs up to 20-30x performance loss on the CPU.

xoft · 06-20-2014, 01:06 AM

20 - 30x performance loss compared to what exactly? To the GPU version? Not a fair comparison.