That's option three. The problem is that pre-processor (in lua) needs to be able to understand the source enough to be able to preform stream-fusion or else the code ends up with large numbers of for loops(one per operation). This is not something that can be done with regexs. And it would be easier to write a small functional language parser than a c++ parser.
EDIT:
As for openCL the portableCL project is MIT licensed and supports anything clang does.
EDIT:
As for openCL the portableCL project is MIT licensed and supports anything clang does.