Got HeightGenClassic to run on my machine: GPU runtime - 30 microseconds, data transfer - 2 microseconds. Total execution time from en-queuing kernel to finish read back: 303 microseconds. Something tells me that we we want to use this we need to have several chunks in-flight at once.
For comparision: Still using the openCL API CPU runtime - 58 microseconds, total runtime 98 microseconds.
SO significant perfornace increases if we can deal with the latency by doing stuff asyncronsly and batching.
For comparision: Still using the openCL API CPU runtime - 58 microseconds, total runtime 98 microseconds.
SO significant perfornace increases if we can deal with the latency by doing stuff asyncronsly and batching.