The treachery of optimization

xoft · (This post was last modified: 02-28-2017, 07:05 AM by xoft.)

I'd like to share an experience I just had while trying to optimize the PieceStructures generator.

As you may know from my presentation at last year's meetup, the PieceStructures generator works by building a "tree" of individual pieces, connected together by connectors. Each piece may have multiple connectors, each connector has a type and orientation, only the same type of connectors are matched together in the resulting structure. So for example the Nether fortress has connections of two types - type "0" for the outside connections, and type "1" / "-1" for the inside connections.

The other day I've noticed that the NetherFort generator takes a rather long time to create one single NetherFort instance - it took a full second (in Debug mode, but still) just to put together some pieces. Since I'm writing a related generator which will most likely be even more performance-tight, I decided to investigate this.

After running the generator in a profiler, I noticed that the function taking the most time is cPieceGenerator::CheckConnection() - not because it does a lot of work, but because it is called lots of times. And I mean lots - about 500 000 times for a single NetherFort. So naturally, I tried to think of an optimization here. What the function does: when a new piece is to be added to the "tree", it needs to be checked whether it collides with the already-placed pieces. This collision-check is done for each candidate that is considered; only those that pass this test are then put into the final random-based decision. So this function checks whether a piece at a specific position and orientation collides with any of the already placed pieces. One natural observation: most of the time, the "tree" of pieces is growing outwards from its starting piece, so it would make sense to have a hitbox that encapsulates all the already placed pieces, check against that hitbox first and if it doesn't collide, then it's clear that none of the placed pieces can collide either. So this big hitbox test should filter away many of the repeated tests with each placed piece.

So I coded this optimization and re-run the profiler. Imagine my surprise when the performance actually halved! Yes, halved, the generator now ran twice as slow, my testcase took 21 seconds instead of 11 seconds! I didn't believe what I was seeing, at first I thought I used different settings for the two perf analyses. A quick stash and revert of the code showed that the parameters were the same and indeed the code is that much slower.

This piqued my interest even more. I suspected that the reason for the perf drop was passing too many parameters - the function already had 6 parameters, I added another one for the overall hitbox, so it could just have made the difference that the params wouldn't fit into registers. So I moved the overall hitbox from the parameters into the class. Not much changed, I got from 21 seconds down to 20 seconds.

Then I had another look at the perf analysis output and it hit me - the cCuboid:: DoesIntersect() function was listed as the top time-eater in my "optimized" code's analysis, but it was no-where to be found in the original code's analysis. A quick trip through the disassembler of the generated code confirmed my suspicion - my "optimization" has caused the compiler to stop inlining the DoesIntersect() function. In turn, this means an extra function call instead of just 9 comparisons, so that more or less explains all the extra time spent in the function.

So you can see how difficult it can be to actually optimize something properly.

Yes, the story is open-ended for now, I'll think about this overnight and perhaps come up with a better optimization. I'll be d***ed if I can't beat those 11 seconds down to 6! Big Grin