Brainstorming: Noise optimization

xoft · (This post was last modified: 03-28-2013, 06:27 AM by xoft.)

As I've already sketched up in FS #337 (http://www.mc-server.org/support/index.p...ask_id=337), the noise generator could be optimized by generating an array of values at once instead of a single value. I'd like to give it a whack.

At the same time, I don't think the code in Noise.cpp that is using SSE is any good - it hasn't even been in use for as long as I've been on this project. So how about I get rid of it in favor of the new array-handling noise?

As for the arrays, I was thinking about creating several new classes. cCubicNoise would generate the same noise that cNoise now does, but for arrays. Then cPerlinNoise would combine several cCubicNoise-s to produce a Perlin noise. If found useful, a cRidgedMulti class would be written to combine two cPerlinNoise-s to produce a ridged multifractal noise.
Each of the noise classes will have functions Generate1D, Generate2D and Generate3D that will take 1D, 2D and 3D arrays of doubles, and coords for the array boundaries in the noise-space.
Because the PerlinNoise and RidgedMulti need an extra array for workspace, I'm thinking about having the possibility of providing this workspace array as an optional parameter - so that each call to GenerateND() doesn't result in a memory allocation and freeing - usually the callers will have the ability to cache and reuse these workspaces.

So these will be the main interface:

Code:
class cCubicNoise

{

public:

    void Generate1D(

        double * a_Array,                ///< Array to generate into

        int a_SizeX,                     ///< Size of the array (num doubles)

        double a_StartX, double a_EndX,  ///< Noise-space coords of the array

        double * a_Workspace = NULL      ///< Workspace that this function can use and trash, same size as a_Array

    );

    void Generate2D(

        double * a_Array,                ///< Array to generate into [x + a_SizeX * y]

        int a_SizeX, int a_SizeY         ///< Size of the array (num doubles), in each direction

        double a_StartX, double a_EndX,  ///< Noise-space coords of the array in the X direction

        double a_StartY, double a_EndY,  ///< Noise-space coords of the array in the Y direction

        double * a_Workspace = NULL      ///< Workspace that this function can use and trash, same size as a_Array

    );

    void Generate3D(

        double * a_Array,                       ///< Array to generate into [x + a_SizeX * y + a_SizeX * a_SizeY * z]

        int a_SizeX, int a_SizeY, int a_SizeZ,  ///< Size of the array (num doubles), in each direction

        double a_StartX, double a_EndX,         ///< Noise-space coords of the array in the X direction

        double a_StartY, double a_EndY,         ///< Noise-space coords of the array in the Y direction

        double a_StartZ, double a_EndZ,         ///< Noise-space coords of the array in the Z direction

        double * a_Workspace = NULL             ///< Workspace that this function can use and trash, same size as a_Array

    );

} ;

// Same interface for the other noise classes.

Anyone any thoughts about this? I'm especially interested in any reasons for keeping the SSE code in.

bearbin · 03-28-2013, 07:02 AM

Seems quite good, especially if it will make performance improvements.

xoft · 03-28-2013, 07:29 AM

I believe it will make a quite substantial performance improvement.
Current noise has to do all these for each point queried:
1, Floor all coords to integral values
2, Calculate underlying noise value at 4x4x4 integral neighbors
3, Cubic-interpolate each layer (4x4x4 -> 4x4), then cubic-interpolate each column (4x4 -> 4), then finally cubic-interpolate the final value (4 -> 1)

With the new system, point 1 and 2 will be done only occasionally (rough expectation - about 5 % of all times) and I think even the interpolation could be tweaked somehow to save a few operations.

ThuGie · 01-15-2014, 10:29 AM

Hey,

Just wondering if this would be something,
http://en.wikipedia.org/wiki/Simplex_noise

xoft · 01-15-2014, 10:56 PM

There's not much info on the noise generation itself.

Anyway, MCS already uses as little noise as possible, speeding it up won't matter too much.

worktycho · (This post was last modified: 01-16-2014, 03:28 AM by worktycho.)

Only thing about noise is its something that is extremely suited to vectorization. However it isn't vectorized automatically because of the noise generation spanning several functions and combining several operations is less common than loop vectorization. I tried some experiments at using the clang and gcc vector extensions but they did not seem to generate sse instructions (other than scalar floating point). It might be worthwhile to use macros to rewrite the code to use sse/avx or neon if available if youre looking at paralleling but that would reduce readability. For example sse2 which is in all x64 machines can preform 4 calculations simultaniously so it might be worth thinking about generating vectors rather than arrays.

FakeTruth · 01-16-2014, 03:48 AM

I tried using SSE for the noise once. It gave the same results, but it was not faster at all.
It required so many functions to get around the "***intrin" functions that it didn't pay off at all. The tiny function to generate a pseudo random number became huge Tongue

worktycho · 01-16-2014, 04:12 AM

What sort of functions, convertions or handling non-sse platforms? If were generating 4 elements at a time it seems obvious to use a vector add rather than 4 separate adds. Also if were doing parallel generation then we can just keep the existing code but make it work on vectors instead.

FakeTruth · 01-16-2014, 05:53 AM

This function

	float cNoise::IntNoise( int a_X )
	{
		int x = ((a_X*m_Seed)<<13) ^ a_X;
		return ( 1.0f - ( (x * (x * x * 15731 + 789221) + 1376312589) & 0x7fffffff) / 1073741824.0f); 
	}

turned into this monster

__m128 SSE_IntNoise( const __m128i & a_X4 )
{
	__m128i X4 = _mm_xor_si128( _mm_slli_epi32( a_X4, 13 ), a_X4 );

	//_mm_sub_ps( _mm_set_ps1( 1.0f ) // 1.f -
	
	__m128 result = _mm_sub_ps( 
		  _mm_set_ps1( 1.0f )
		, _mm_div_ps( // ( ( (x * ((x*x)*15731 + 789221)) + 1376312589 ) & 0x7fffffff ) / 1073741824.0f
			_mm_cvtepi32_ps( // (float) -> converts to float
				_mm_and_si128( // ( (x * ((x*x)*15731 + 789221)) + 1376312589 ) & 0x7fffffff
					  _mm_set1_epi32( 0x7fffffff ) // 0x7fffffff
					, _mm_add_epi32( // (x * ((x*x)*15731 + 789221)) + 1376312589
						  _mm_set1_epi32( 1376312589 ) // 1376312589
						, _mm_mul_epu32( // x * ((x*x)*15731 + 789221)
							 X4
						   , _mm_add_epi32( // ((x*x)*15731 + 789221)
							   _mm_set1_epi32( 789221 ) // 789221
							 , _mm_mul_epu32( // ((x*x)*15731)
								  _mm_mul_epu32( X4, X4 ) // x*x
								, _mm_set1_epi32( 15731 ) // 15731
								)
							)
						)
					)
				)
			)
			, _mm_set_ps1( 1073741824.0f ) // 1073741824.0f
		)
	);

	return result;
}

xoft · 01-16-2014, 06:18 AM

I think this was the wrong approach - we were getting a value for single coords; now we have the opportunity to optimize in truly vector fashion - with the cCubicNoise class, each of the Generate() functions operats on an entire array of neighboring noise values. I believe that *could* be optimized with the vector instructions, but I don't have the guts to do it properly.