GPU code

worktycho · 06-05-2014, 08:18 PM

I've done some experimentation and decided that there are four main techniques we can use to write code for the GPU. I just want to get peoples opinions before I start on a proper implementation.

1. C++ templates as computation monads.
Pros: Pure C++, all done in the compile.
Cons: Massively complicated types (50-60 lines long is typical). Means that chaining computation monads across function calls so the code doesn't go back to main memory is impossible without C++14.
Monads are slightly confusing. Execution is often not where it seems to be.

2. C++ code as a code-generator (write a set of objects so doing computation with them results in a program which writes the appropriate code. See the Mill assembler)
Pros: Simple to understand code. Very similar to existing code.
Cons: MSBuild would require two separate compiles every time the generator is modified.

3. Code-generator for a functional language.
Pros: We can keep an executable so MSBuild can do this in one compile. Simple to understand code.
Cons: Its a new language.

4. Separate Implementations
Pros: Easy to set up, just reimplement the existing generator in OpenCL/OpenGL.
Cons: Every change has to be made in triplicate.

So which option should we go for?

xoft · 06-05-2014, 08:27 PM

Since I haven't done much along these lines, I don't really know what to imagine under each solution. Is there an easy-to-grasp example that you could provide for each of the proposed methods?

worktycho · (This post was last modified: 06-05-2014, 08:42 PM by worktycho.)

Yes:

Add two vector and multiply by a third.

Method 1:

Code:
Vector<int, 16> a, b, c;

auto temp = VecMath::Add(a,b);

Vector<int, 16> result = VecMath::Multiply(temp, c).Evaluate();

Note no actually computation is done until the .Evaluate Method.

Method 2:

Code:
Vector<int, 16> a, b, c;

Vector<int, 16> result = (a + b) * c;

In this case this code would compile to a program that generates the actual implementation.

Method 3:

Code:
a :: Vector Int 16

b :: Vector Int 16

c :: Vector Int 16

result = (a + b) * c

This would then be compiled to the code required

Method 4:

C++

Code:
int a[16], b[16], c[16];

int result[16];

for (int i = 0; i < 16; i++) result[i] = (a[i] + b[i]) * c[i];

openCL

Code:
int16 a, b, c;

int16 result;

result = (a + b) * c;

openGL ES

Code:
int16 a, b, c;

int16 result;

result = (a + b) * c;

OpenCL and OpenGL ES have significant differences not shown here so would need separate implementations.

worktycho · 06-06-2014, 02:54 AM

My personal preference is option two however its main disadvantage is windows only and could be a big impact on widows developers so I though I should post the pros and cons of the various methods. If no one says anything I'll go for option two.

tigerw · 06-06-2014, 03:03 AM

Eh, two compiles isn't that bad for me, as long as everything is automatic.

xoft · 06-06-2014, 03:25 AM

I have to say my favorite, at least from these examples, is option 1. It does represent the code pretty well and it doesn't require an extra step (which is always painful). To give you an example why we don't want double-compile, let me just say two words: tolua + android.

Does it really require C++14, isn't C++11 enough? If this code takes a bit longer to develop, we might get C++11-ready in the meantime, but C++14 is too far out of reach.

worktycho · (This post was last modified: 06-06-2014, 03:42 AM by worktycho.)

I can do it with c++11 except for one thing. Returning partially computed values from multiple statement functions. This is important because you don't want to transfer stuff from to and from the GPU if can prevent it. C++14 adds automatic return types which would enable functions to return types without having to specify them in full. MSVC 2013 Nov CTP actually supports this feature so it might not be to much of an issue.

Whilst on linux option 2 is my favourite I recognise that on windows option 1 is probably better although there are disadvantages. Thats why I prototyped option 1 and came across the problem with returns.

xoft · 06-06-2014, 05:33 AM

The question is, do we even need to return the partially computer values?

worktycho · 06-06-2014, 05:49 AM

Yes if you want to have functions of a reasonable size. Taking the example of the lighting code, PrepareBlockLight and CalcLight. PrepareBlockLight sets up the the initial seeds which CalcLight diffuses. Without partial computation the BlockLight values have to be written to memory and that flushes 2/3 of L1 cache on a pi, along with being ridiculously expensive on a GPU. Whereas if we can return the monad the computations can be pipelined. Another option is to just keep to single expression functions and use

Code:
#define RETURN(x) -> decltype(x) { return x; }

xoft · (This post was last modified: 06-09-2014, 05:40 AM by xoft.)

From a somewhat different perspective: I'd like to refactor most of the terrain generating code, because currently it's pulling the noise values somewhat slowly. I figured I could make a class that would hold the 2D or 3D array of noise values, define operations on that object first (such as "generate perlin", "clamp to", "add a constant to all", and "add another array"), then let it calculate all the values and finally use them. My point here is that this would allow us to refine the upscaling - the arrays could calculate all the operations in a smaller scale and then linearly upscale only the results. Quite a lot of generators take a noise, upscale it, and then do some math with it before using it.
Will such a class play nicely with your GPU optimizations? Could it perhaps act as the wrapper for various calculating backends (your method 4)?

Pseudocode example:

cValueArray2D<4, 4> Values(17, 17);  // 2D chunk-sized value array with 4x upscaling in each direction
Values.GeneratePerlin(...);
cValueArray2D<4, 4> Multiplier(17, 17);
Multiplier.GeneratePerlin(...);
Values.Multiply(Multiplier);
Values.Clamp(0, 1);
Values.MultiplyByConstant(80);
Values.AddConstant(40);
Values.Calculate();  // This does all the actual calculations
// Values are calculated in a 4x smaller version first (5*5 numbers),
// and the final results are then linearly upscaled to the full dimension (17*17 numbers)