If you have some previous experience with GPU compute, or if you have watched the GPU Compute for Mobile Devices at ARM Techcon Developer Summit presentation, and you have a Compute application that you want to optimize, it may be hard to know where to start. You have been given some advice, but it can be hard to know what kinds of optimizations are relevant for your particular kernels.
At the ARM Techcon Developer Summit, I talked about that problem, trying to give an intuition about how threads whirl around inside the cores while executing your kernels. As always, a prerequisite to successful optimization is obtaining some understanding of where the bottlenecks might be. For Mali, the first part of this presentation aims at giving that understanding. Armed with an understanding of how execution happens, the hardware counters in the GPU give the necessary capability of looking inside the cores to see what is actually going on while your program is running. Streamline gives a nice time-line view of many kinds of counters, and the second part of this presentation introduces them and their use for optimizing Compute kernels.
If you have any questions, this website is the place to ask.
Have fun!