Quantcast
Channel: ARM Mali Graphics
Viewing all articles
Browse latest Browse all 266

Energy Efficiency in GPU Applications, Part 1

$
0
0

In this blog I will talk about energy efficiency in embedded GPUs and what an application programmer can do to improve the energy efficiency of their application. I have split this blog into two parts; in the first part I will give an introduction to the topic of energy efficiency and in the second part I will show some real SoC power measurements by using an in-house micro-benchmark to demonstrate the extent to which a variety of factors impact frame rendering time, external bandwidth and SoC power consumption.

 

Energy Efficiency in the GPU/Device

 

Let's look first at what energy efficiency means from the GPU's perspective.  At a high level the energy is consumed by the GPU and its associated driver in three different ways:

 

  • GPU is running active cycles in the hardware to perform its computation tasks in one or more of its cores.
  • GPU/driver is issuing memory transactions to read data from, or write data to, the external memory.
  • GPU driver code is executed in the CPU either in the user mode or in the kernel mode.

 

On most devices Vertical Synchronization (vsync) synchronizes the frame rate of an application with the screen display rate. Using vsync not only removes tearing, but it also reduces power consumption by preventing the application from producing frames faster than the screen can display them. When vsync is enabled on the device the application cannot draw frames faster than the vsync rate (vsync rate is typically 60fps on modern devices so we can keep that as our working assumption in the discussion). On the other hand, in order to give the best possible user experience the application/GPU should not draw frames significantly slower than the vsync rate i.e. 60fps. Therefore the device/GPU tries hard to keep the frame rate always at 60fps, while also trying to use as little power as possible.

 

A device typically has power management functionality for both GPU and CPU in order to adjust their operating frequencies based on the current workload. This functionality is referred to as DVFS (Dynamic Voltage and Frequency Scaling). DVFS allows the device to handle both normal and peak workload in an energy efficient fashion by adjusting the clock frequency to provide just enough performance for the current workload, which in turn allows us to drop the voltage as we do not need to drive the transistors as hard to meet the more relaxed timing constraints. The energy consumed per clock is proportional to V2, so if we drop frequency to allow a voltage reduction of 20% then energy efficiency would improve by 36%. Using a higher clock frequency than needed means higher voltage and consequently higher power consumption, therefore the power management tries to keep the clock frequency as low as possible while still keeping the frame rate at the vsync rate. When the GPU is under extremely high load some vendors allow the GPU to run at an overdrive frequency - a frequency which requires a voltage higher than the nominal voltage for the silicon process - which can provide a short performance boost, but cannot be sustained for long periods. If high workload from an application keeps the GPU frequency overdriven for a long time, the SoC may become overheated and as a consequence the GPU is forced to use a lower clock frequency to allow the SoC to cool down even if the frame rate goes under 60fps. This behavior is referred to as thermal throttling.

 

Device vendors often differentiate their devices by making their own customizations to the power management. As a result two devices having the same GPU may have different power management functionality. The ARM® Mali™ GPU driver provides an API to SoC vendors that can be used for implementing power management logic based on the ongoing workload in the GPU.

 

In addition to DVFS, some systems may also adjust the number of active GPU cores to find the most energy efficient configuration for the given GPU workload. Typically, DVFS provides just a few available operating frequencies and enabling/disabling cores can be used for fine-tuning the processing capacity for the given workload to save power.

 

In its simplest form the power management is implemented locally for the GPU i.e. the GPU power management is based only on the ongoing GPU workload and the temperature of the chip. This is not optimal as there can be several other sub-systems on the chip which all "compete" with each other to get the maximum performance for its own processing until the thermal limit is exceeded and all sub-systems are forced to operate in a lower capacity. A more intelligent power management scheme maintains a power budget for the entire SoC and allocates power for different sub-systems in a way that thermal throttling can be avoided.

 

Energy Efficiency in Applications


From an application point of view the power management functionality provided by the GPU/device means that the GPU/device always tries to adjust the processing capacity for the workload coming from the application. This adjustment happens automatically in the background and if the application workload doesn't exceed the maximum capacity of the GPU, the frame rate remains constantly at the vsync rate regardless of the application workload. The only side effect from the high application workload is that the battery runs out faster and you can feel the released energy as a higher temperature of the device.

 

Most applications don't need to create a higher workload than the GPU's maximum processing capacity i.e. the power management is able to keep the frame rate constantly at the vsync level. The interval between two vsync points is 1/60 seconds and if the GPU completes a frame faster than that, the GPU sits idle until the next frame starts. If the GPU constantly has lots of idle time before the next vsync point, the power management may decrease the GPU clock frequency to a lower level to save power.

 

As the maximum processing capacity of modern GPUs keeps growing, it is often not necessary for an application developer to optimize the application for better performance, but instead for better energy efficiency and that is the topic of this blog.

 

How to Make an Application Energy Efficient


In order to be energy efficient the application should:

 

  • Render frames with the least number of GPU cycles
  • Consume the least amount of external memory bandwidth
  • Generate the least amount of CPU load either directly in the application code or indirectly by using the OpenGL® ES API in a way that causes unnecessary CPU load in the driver

 

But hey, aren't these the same things that you used to focus on when optimizing your application for better performance? Yes, pretty much! To explain this further:

 

  • Every GPU cycle that you save when rendering a frame means more idle time in the GPU before the next vsync point. In the best case the idle time becomes long enough to allow the power management to use a lower GPU frequency or enable a smaller number of cores
  • Reducing bandwidth load doesn't always improve performance as GPUs are designed to tolerate high memory latencies without affecting performance. However, reducing bandwidth can improve energy efficiency significantly
  • The same as for bandwidth, extra CPU load may not impact performance but it definitely can increase the power consumption

 

So the task of improving energy efficiency becomes very similar to the task of optimizing the performance of an application. For that task you can find lots of useful tips in the Mali GPU Application Optimization Guide.


How Do You Measure Energy Efficiency?

 

There is one topic that may require some more attention: how can you measure the energy efficiency of your application? Measuring the actual SoC power consumption might not be practical. It might also be problematic to measure the system FPS of your application if vsync is enabled on your device and you cannot turn it off.

 

ARM provides a tool called DS-5 Streamline for system-wide performance analysis. Using DS-5 Streamline for detecting performance bottlenecks is explained in Peter Harris' s blog Mali Performance 1: Checking the Pipeline and Lorenzo Dal Col's blogs starting with Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel, and also in the Mali GPU Application Optimization Guide. Shortly, DS-5 Streamline allows you to measure the main components of energy efficiency with the following charts / HW counters:

 

GPU cycles:

  • Mali Job Manager Cycles: GPU cycles
    • This counter increments any clock cycle the GPU is doing something
  • Mali Job Manager Cycles: JS0 cycles
    • This counter increments any clock cycle the GPU is fragment shading
  • Mali Job Manager Cycles: JS1 cycles
    • This counter increments any clock cycle the GPU is vertex shading or tiling

 

External memory bandwidth:

  • Mali L2 Cache: External read beats
    • Number of external bus read beats
  • Mali L2 Cache: External write beats
    • Number of external bus write beats

 

CPU load:

  • CPU Activity
    • The percentage of the CPU time spent in system or user code

 

Another very useful tool for measuring GPU cycles is the Mali Offline Shader Compiler which allows you to see how many GPU cycles are spent in the arithmetic, load/store and texture pipes in the shader core. Each saved cycle in the shader code means thousands/millions of saved cycles in each frame, as the shader is executed for each vertex/fragment.

 

If you want to measure the performance of an application in a vsync limited device, it is possible to do it by rendering graphics in offscreen mode using FBOs. This is the trick used by some benchmark applications to get rid of vsync and resolution limitations in the performance measurement. The thing is that the vsync limitation applies only for the onscreen frame buffer, but not for the offscreen framebuffers implemented with FBOs. It is possible to measure performance by rendering to an FBO that has the same frame buffer resolution and configuration (color and depth buffer bit depths) as the onscreen frame buffer. After setting up the FBO and binding it with glBindFramebuffer() your rendering functions don't see any difference whether the render target is the onscreen frame buffer or an FBO. However, in order to make the performance measurement work correctly you need to do a few things:

 

  • You need to consume your FBO rendering results in the onscreen frame buffer. This is necessary because if you render something to an FBO and don't use your rendering results for anything visible, there is no guarantee that the GPU actually renders anything. After rendering to an FBO you can down-sample your output texture into a small area in the onscreen frame buffer. This guarantees that the GPU must render the frame image into an FBO as expected.
  • The offscreen rendering should be implemented with two different FBOs in order to simulate double buffering functionality. After rendering a frame to an FBO, you should down-sample the output texture to the onscreen buffer, and then swap to another FBO that is used for rendering the next frame.
  • You should Use glDiscardFramebufferExt (OpenGL ES 2.0) or glInvalidateFramebuffer (OpenGL ES 3.0) for discarding depth/stencil buffers right after the rendering of a frame to an FBO is complete. This is necessary to avoid writing out the depth/stencil buffer to the main memory in the Mali GPU (the same effect happens for the onscreen frame buffer when you call eglSwapBuffers()). You can find some details of this topic in Mali Performance 2: How to Correctly Handle Framebuffers.
  • After rendering a suitable number of offscreen frames (for example 100) and down-sampling them to a small area in the onscreen frame, you can call eglSwapBuffers() as normal to present the frame in the onscreen buffer. You can measure the offscreen FPS by dividing the total number of rendered offscreen frames by the total rendering time measured when eglSwapBuffers() returns.

 

There is a small overhead in the performance measurement when using this method because of down-sampling the offscreen frames to the onscreen frame, but nevertheless it should give you quite representative FPS results without the vsync limitation.

 

Is it really worth it?

 

You might ask how significant an energy saving you can really get by optimizing your application. We will focus on that in the next part of this blog where I will present a small micro-benchmark that will show how much you can reduce real SoC power consumption by optimizing your application.


Viewing all articles
Browse latest Browse all 266

Trending Articles