Posts: 783
Threads: 12
Joined: Jan 2014
Thanks: 2
Given 73 thank(s) in 61 post(s)
01-19-2014, 07:16 AM
(This post was last modified: 01-19-2014, 07:18 AM by worktycho.)
Because your looking at 4-5 variants to cover standard compliant c++, SSE2, AVX, AVX2 and NEON. Also the vector specific code takes up 70-80% of the functions I'm looking at optimising so it would be far cleaner just to have 5 functions and then keeping them in sync is a pain.
Also with the path I'm going down at the moment it would be easy to add AVX-512 support when intel release skylake in 2015. It would also be easier to add support for any strange platforms that someone wants to use (GPUs, PowerPC in an old mac, MIPS on a PS3, Itanium, Mill if it gets released).
Posts: 6,485
Threads: 176
Joined: Jan 2012
Thanks: 131
Given 1075 thank(s) in 852 post(s)
The real question is, why doesn't the C++ compiler optimize our (single) C++ code into those instruction sets? I'd very much prefer clean C++ code that would compile to slightly-less-than-ideal machine code, than squeeze out some 5 % of performance through such a huge code overhaul like this.
I know even VS2008 supports SSE2 instructions (at least it has a switch for them in the project properties UI), so it should be possible to have it output the optimized code somehow. It's more likely that we're not doing something right if our code doesn't get optimized.
Posts: 783
Threads: 12
Joined: Jan 2014
Thanks: 2
Given 73 thank(s) in 61 post(s)
01-19-2014, 07:40 AM
(This post was last modified: 01-19-2014, 07:47 AM by worktycho.)
It doesn't optimize because the compiler needs to prove that the loop is executed a multiple of the vector number of times. I've read the assembly and its not vectorizing it. Also there is a problem in floating point reordering causing a change in precision preventing vectorisation even with -ffast-math. I think the main problem is that a number of functions span translation units so the arrays have to be converted back to arrays of doubles for call. That is the classic C++ problem of not being able to do cross translation unit optimizations.
Posts: 1,450
Threads: 53
Joined: Feb 2011
Thanks: 15
Given 120 thank(s) in 91 post(s)
At least NEON looks very similar to SSE. You could simply #define these functions and it should work
Posts: 1,450
Threads: 53
Joined: Feb 2011
Thanks: 15
Given 120 thank(s) in 91 post(s)
Exactly. I would also say try optimizing it before writing some sort of fancy parser, the optimization might not work at all and then you have done all this work for nothing. I seriously think you are over complicating things.
Posts: 783
Threads: 12
Joined: Jan 2014
Thanks: 2
Given 73 thank(s) in 61 post(s)
Yes, I'm working on the comparison tool at the moment. Its proving a lot more work then expected because the generator is quite tightly coupled indirectly.
Posts: 6,485
Threads: 176
Joined: Jan 2012
Thanks: 131
Given 1075 thank(s) in 852 post(s)
I had a Heureka moment last night. If I remember correctly, it is possible to templatize functions not only on type, but also on an integer value, right? This might make the cCubicNoise::GenerateX() functions faster - if they had the array dimensions as template parameters, rather than regular parameters, the compiler would be able to make much more optimizations, such as unrolling the loop and perhaps even vectorizing the calculation, if it's smart enough. We're using a small set of dimensions anyway - usually only a 5x5 array.
It might be interesting to implement at least a 2D generator in this way and compare its performance with the current approach, whether we get any substantial difference on, let's say, a million 5x5 arrays generated. The NoiseTest project should be able to measure this, when modified.
Another set of functions quite likely to benefit from such an optimization would be the the linear upscaling functions ($/src/LinearUpscale.h), used heavily in the generator. Again, a test confirming this would be nice, and it doesn't require the tiresome decoupling.
Posts: 783
Threads: 12
Joined: Jan 2014
Thanks: 2
Given 73 thank(s) in 61 post(s)
01-26-2014, 12:15 AM
(This post was last modified: 01-26-2014, 12:18 AM by worktycho.)
That sounds like an Idea, also if the compilers aren't smart enough to vectorize and unroll you can also implement specialized templates for common values.
Only slight issue is that it means that all the functions have to be inline, but whether that's an advantage (cross-method optimizations like inlining and calling convention modification) or a disadvantage(Increases binary size) I'm not sure.
I'll keep working on decoupling as in the worst case its very useful ground work for a chunk server.
Posts: 783
Threads: 12
Joined: Jan 2014
Thanks: 2
Given 73 thank(s) in 61 post(s)
Just discovered a new tool that should make it easier to write c++ that can be optimised to loops. GCC will tell you why it isn't vectorizing every loops if you pass it the ftree-vectorizer-verbose=7 flag. Writing it down so I don't forget about it.