Quantcast
Channel: ARM Mali Graphics
Viewing all articles
Browse latest Browse all 266

Energy Efficiency in GPU Applications, Part 2

$
0
0

In the second part of Energy Efficiency in GPU Applications, Part 1 I will show some real SoC power consumption numbers and how they correlate with the workload coming from an application.


Study: How Application Workload Affects Power Consumption

 

We made a brief study to find out how an application workload affects SoC power consumption. The idea of the study was to develop a small micro-benchmark that runs at just above 60fps on the target devices i.e. it is always vsync limited. Here is a screen shot from the micro-benchmark (it is called Torus Test):

 

torus.png

To leave some room for optimization we added a few deliberate performance issues to the original version of the micro-benchmark:

 

  • The vertex count is too high
  • The texture size is too high
  • The fragment shader consumes too many cycles
  • Back-face culling is not enabled

 

We wanted to see how power consumption is affected when we reduce the workload by fixing each of the above performance issues individually. All of these performance issues and related optimizations are somewhat artificial for being used directly with real applications. The micro-benchmark was written on purpose in a way that none of these crazy optimizations have any major visual impact, but with a real-world application you probably wouldn't be able to decrease the texture resolution from 1920x1920 to 96x96 without a drastic impact on the visual quality of the application. However, the effect of the optimizations described here is the same as the effect of optimizing real applications: you improve the energy efficiency of your application by reducing GPU cycles and bandwidth consumption.

 

At ARM we have a few development SoCs that can be used for measuring actual SoC power consumption which we were able to use in the study. The micro-benchmark allows the measurement of system FPS in offscreen rendering mode without the vsync limit, as described previously.  In the result graphs we use the frame time instead of the system FPS (frame time = 1s / system FPS), because that corresponds to the number of GPU cycles that consume power on the GPU.  We also used the L2 cache external bandwidth counters for measuring the bandwidth consumed by the GPU. By using these metrics we wanted to see how the workload in the application and GPU correlates with the power consumption in the SoC. Here are the results.

 

Decreasing Vertex Count


The micro-benchmark allows us to configure how many vertices are drawn in each frame. We tested three different values (4160, 2940 and 1760). The following graph shows how the vertex count correlates with the frame time and SoC Power:

 

vertex_count.png

This micro-benchmark is not very vertex heavy but still the correlation between vertex count and SoC power consumption is clear. When decreasing the vertex count, power is not only saved by reduced vertex shader processing, but also because there is less external bandwidth needed to copy vertex data to/from the vertex shading core. Therefore we can also see the correlation between vertex count and external bandwidth in the above graph.

 

Decreasing Texture Size


The micro-benchmark uses a generated texture for texture mapping, which makes it possible to configure the texture size. We tested the performance with three different texture sizes (1920x1920, 960x960 and 96x96). Each object is textured with a separate texture object instance. As expected, the texture size doesn't affect the frame time much but it affects the external bandwidth. We found the following correlation between texture size, external bandwidth and SoC power:

 

texture_size2.png

Notice that the bandwidth doesn't decrease linearly with the number of texels in a texture. This is because with a smaller texture size there is a much better hit rate in the L2 cache, which quickly reduces the external bandwidth.

 

Decreasing Fragment Shader Cycles


The micro-benchmark implements a Phong shading model with a configurable number of light sources.  We tested the performance with three different values for the number of light sources (5, 3, and 1). The Mali Shader Compiler outputs the following cycle count values for these configurations:

 

Light SourcesArithmetic CyclesLoad/Store CyclesTexture Pipe CyclesTotal Cycles
5392142
3272130
1143118

 

We found the following correlation between the number of fragment shader cycles, frame time and SoC power:

 

fs_cycles.png

 

Adding Back-Face Culling and Putting All Optimizations Together


Finally, we tested the SoC power consumption impact when enabling back-face culling and when including all the previous optimization at the same time:

 

all_optimizations.png

With all these optimizations we managed to reduce the SoC power consumption to less than 40% compared to the original version of the micro-benchmark. At the same time the frame time reduced to less than 30% and the bandwidth to less than 10% of the original micro-benchmark. Note that the large relative bandwidth reduction is possible due to the fact that writing the onscreen frame buffer to the external memory consumes very little bandwidth in this micro-benchmark, as Transaction Elimination was enabled in the device which is very effective with this application because there are lots of tiles filled with the constant background color that don't change between frames.

 

Conclusion


I hope this blog and the case study example has helped you to better understand the factors which impact on energy efficiency and the extent to which SoC power consumption can be reduced through optimizing GPU cycles and bandwidth in an application. As the processing capacity of embedded GPUs keeps growing, an application developer can often change the focus from performance optimization to energy efficiency optimization, which means that the desired visual output is implemented without consuming cycles or bandwidth unnecessarily. You should also consider the trade-off between improved visual quality and increased power consumption; is that last piece of "eye candy" which increases processing requirements by 20% really worth a 20-36% drop in the battery life for the end users of the application?


If you have any further questions, please don’t hesitate to ask them in the comments below.


Viewing all articles
Browse latest Browse all 266

Trending Articles