Energy Efficiency in GPU Applications, Part 2

In the second part of Energy Efficiency in GPU Applications, Part 1 I will show some real SoC power consumption numbers and how they correlate with the workload coming from an application.

Study: How Application Workload Affects Power Consumption

We made a brief study to find out how an application workload affects SoC power consumption. The idea of the study was to develop a small micro-benchmark that runs at just above 60fps on the target devices i.e. it is always vsync limited. Here is a screen shot from the micro-benchmark (it is called Torus Test):

To leave some room for optimization we added a few deliberate performance issues to the original version of the micro-benchmark:

The vertex count is too high
The texture size is too high
The fragment shader consumes too many cycles
Back-face culling is not enabled

We wanted to see how power consumption is affected when we reduce the workload by fixing each of the above performance issues individually. All of these performance issues and related optimizations are somewhat artificial for being used directly with real applications. The micro-benchmark was written on purpose in a way that none of these crazy optimizations have any major visual impact, but with a real-world application you probably wouldn't be able to decrease the texture resolution from 1920x1920 to 96x96 without a drastic impact on the visual quality of the application. However, the effect of the optimizations described here is the same as the effect of optimizing real applications: you improve the energy efficiency of your application by reducing GPU cycles and bandwidth consumption.

At ARM we have a few development SoCs that can be used for measuring actual SoC power consumption which we were able to use in the study. The micro-benchmark allows the measurement of system FPS in offscreen rendering mode without the vsync limit, as described previously. In the result graphs we use the frame time instead of the system FPS (frame time = 1s / system FPS), because that corresponds to the number of GPU cycles that consume power on the GPU. We also used the L2 cache external bandwidth counters for measuring the bandwidth consumed by the GPU. By using these metrics we wanted to see how the workload in the application and GPU correlates with the power consumption in the SoC. Here are the results.

Decreasing Vertex Count

The micro-benchmark allows us to configure how many vertices are drawn in each frame. We tested three different values (4160, 2940 and 1760). The following graph shows how the vertex count correlates with the frame time and SoC Power:

This micro-benchmark is not very vertex heavy but still the correlation between vertex count and SoC power consumption is clear. When decreasing the vertex count, power is not only saved by reduced vertex shader processing, but also because there is less external bandwidth needed to copy vertex data to/from the vertex shading core. Therefore we can also see the correlation between vertex count and external bandwidth in the above graph.

Decreasing Texture Size

The micro-benchmark uses a generated texture for texture mapping, which makes it possible to configure the texture size. We tested the performance with three different texture sizes (1920x1920, 960x960 and 96x96). Each object is textured with a separate texture object instance. As expected, the texture size doesn't affect the frame time much but it affects the external bandwidth. We found the following correlation between texture size, external bandwidth and SoC power:

Notice that the bandwidth doesn't decrease linearly with the number of texels in a texture. This is because with a smaller texture size there is a much better hit rate in the L2 cache, which quickly reduces the external bandwidth.

Decreasing Fragment Shader Cycles

The micro-benchmark implements a Phong shading model with a configurable number of light sources. We tested the performance with three different values for the number of light sources (5, 3, and 1). The Mali Shader Compiler outputs the following cycle count values for these configurations:

Light Sources	Arithmetic Cycles	Load/Store Cycles	Texture Pipe Cycles	Total Cycles
5	39	2	1	42
3	27	2	1	30
1	14	3	1	18

We found the following correlation between the number of fragment shader cycles, frame time and SoC power:

Adding Back-Face Culling and Putting All Optimizations Together

Finally, we tested the SoC power consumption impact when enabling back-face culling and when including all the previous optimization at the same time:

With all these optimizations we managed to reduce the SoC power consumption to less than 40% compared to the original version of the micro-benchmark. At the same time the frame time reduced to less than 30% and the bandwidth to less than 10% of the original micro-benchmark. Note that the large relative bandwidth reduction is possible due to the fact that writing the onscreen frame buffer to the external memory consumes very little bandwidth in this micro-benchmark, as Transaction Elimination was enabled in the device which is very effective with this application because there are lots of tiles filled with the constant background color that don't change between frames.

Conclusion

I hope this blog and the case study example has helped you to better understand the factors which impact on energy efficiency and the extent to which SoC power consumption can be reduced through optimizing GPU cycles and bandwidth in an application. As the processing capacity of embedded GPUs keeps growing, an application developer can often change the focus from performance optimization to energy efficiency optimization, which means that the desired visual output is implemented without consuming cycles or bandwidth unnecessarily. You should also consider the trade-off between improved visual quality and increased power consumption; is that last piece of "eye candy" which increases processing requirements by 20% really worth a 20-36% drop in the battery life for the end users of the application?

If you have any further questions, please don’t hesitate to ask them in the comments below.

Energy Efficiency in GPU Applications, Part 2

Study: How Application Workload Affects Power Consumption

Decreasing Vertex Count

Decreasing Texture Size

Decreasing Fragment Shader Cycles

Adding Back-Face Culling and Putting All Optimizations Together

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112