ARM’s Multimedia IP portfolio is designed to work together to reduce overall system power while delivering the performance that is central to the mobile device experience.
What is the Multimedia Experience?
Most of the interactions that users have with modern tablets and smartphones count as a multimedia experience, integrating sound, vision and interaction for every task. From our point of view, the most complex part of this experience is vision - pushing pixels.
Device resolutions are growing fast. Already, smartphones commonly run at 1080p and tablets sport 2560x1600 display panels. There is no sign of this trend slowing down any time soon. At the same time, the display refresh is expected to be smooth, with 60 frames per second now seen as a minimum target rather than a luxury.
The amount of computing power required to calculate each pixel is also going up. Operating systems usually allocate a separate frame buffer for each application on the screen, and then compose these outputs onto the final display. The applications themselves use more sophisticated shaders to represent lighting and surface detail, and this applies to UIs as well as just games.
All of this requires more computation, more memory bandwidth and, unless we are careful, a lot more power.
Multimedia System Components
The multimedia system consists of a number of components which each serve a different function. Each task is handled by a hardware block which has inputs, intermediate data and outputs, all of which contribute to the total power budget for that task.
Here are some use cases which will hopefully illustrate what I mean and introduce each of ARM’s IP blocks:
Use case | Hardware | Input data | Intermediate data | Output data |
Video encoding | Camera, ARM® Mali™-V500 VPU | Uncompressed frames | Reference frames | Video stream |
Video decoding | Mali-V500 VPU | Video stream | Reference frames | Uncompressed frames |
UI | Geometry, Textures |
| Rendered images | |
Gaming | Mali-T628 GPU | Geometry, Textures | G-buffers (render-to-texture) | Rendered images |
Standalone Optimization
The most obvious place to start is to optimize each IP block on its own. The ARM Mali GPU team is architecting the Midgard series of GPUs and video processors to be best-in-class in terms of efficiency.
Working closely with the semiconductor foundries, we have also developed the ARM POP™ IP for Mali GPUs, which increases the performance per watt of our GPUs using a number of targeted low-level optimizations. We have customized the cell library and layout rules to best match the characteristics of the manufacturing process. We also created compound cells such as multi-bit flip-flops and custom memory layouts that best serve the unique requirements of GPU data.
We also include additional tools for defining and implementing power gating rules, so that existing clock-gating strategies can be extended to reduce static as well as dynamic power.
Bandwidth, Bandwidth, Bandwidth
One of the most power hungry parts of the system is the memory. As the number and complexity of pixels increases, so does the memory bandwidth requirement.
Memory technology is getting better and lowering power for each access, but the increase in bandwidth requirement overwhelms that trend. Here are some approximate values for just the DRAM chip itself:
These values are for 2 channel LPDDR, averaged from various online sources. Add in the memory controller power and interconnect, and the problem only looks worse.
Another problem apart from the demand for raw bandwidth is that the “Random” in “Dynamic Random Access Memory” isn’t that random any more. With the RAM core speed not increasing, the interfaces are serializing wider and wider internal access widths onto the bus. If you want 1 byte, you still get its 63 neighbours along for the ride. ARM IP is designed to ensure good locality of access so that those additional bytes are likely to contain data which will be required soon. The job of the cache on a mobile GPU is as much to control memory bandwidth as to increase performance.
It’s good to get the best bandwidth for your data, of course, but the data that take the least bandwidth are the data that you never read (or write).
Texture Compression
In some games, 90% of the memory read bandwidth can be texture accesses. Anyone who has read this blog for a while will know that I am about to sing the praises of ASTC again. And I will, but only in a very quick summary. If you want more details, see my previous blog.
ARM’s ASTC, or Adaptive Scalable Texture Compression, is a new texture compression format which offers better quality at lower bit rate than all of the current low dynamic range compression schemes available today, and matches the performance of the de facto high dynamic range codec. By allowing content developers to more finely tune their texture compression, ASTC will reduce the bandwidth required for textures further.
And now, for the first time, ASTC-capable hardware is in the hands of consumers thanks to the ARM Mali-T628 based Samsung Galaxy Note 3. For the consumer, the inclusion of ASTC means that applications which make use of it will have visibly better texture quality and smaller texture size, often at the same time. Smaller texture sizes result in faster downloads and, most importantly of all, lower power consumption as the GPU requires fewer memory accesses to display them.
Transaction Elimination
All the ARM Mali GPUs are tile-based renderers, coloring the pixels for a single tile of the screen in a small internal memory before writing it out to main memory. However, if the pixels in the tile have not changed since the last time it was written, there is no need to write it again. We can eliminate the memory transaction.
This reduces the bandwidth required to write the frame buffer, and with resolutions going up all the time, the frame buffer bandwidth is considerable. For the larger tablets, a 2560x1600 pixel display at 24 bits/pixel and 60 frames per second update requires a whopping 750MB/s just to write.
Transaction elimination helps to reduce that by between 30 and 80 percent, especially in crucial long-running use cases like UI, web browsing, and Angry Birds.
Frame Buffer Compression
How can we reduce frame buffer bandwidth further? An obvious idea is to compress it, but it’s not so obvious how. We need to preserve quality, so it should be lossless. It needs to be fast and cheap to compress and decompress. For video and GPU use cases, it also needs fast random access.
With the ARM Mali-V500 video processor and future High-End Midgard GPUs, we have included support for ARM Frame Buffer Compression, or AFBC for short.
This is the secret of the ARM Mali-V500’s astonishingly low bandwidth figures. By compressing the intermediate reference frames used by the video codec, the bandwidth drops dramatically. Typical Blu-Ray content can be compressed by 40%, and this saving is multiplied with every read and write. For details, see my colleague Ola’s blog Mali-V500 video processor: reducing memory bandwidth with AFBC.
Future High-End Midgard GPUs will support AFBC for input, so they can directly use compressed video input, and also for output. This supports the popular technique of G-buffering, where intermediate rendering results are rendered out by the GPU and reused as textures in a final pass. This can be used to reduce computation but at the expense of bandwidth. By using AFBC, the bandwidth is reduced and the applicability of the technique widens.
Tying the System Together
These bandwidth reduction techniques can be applied to single cores, but the full potential is only realizable using a fully joined-up approach.
With all the IP blocks in the system supporting these technologies, we can achieve significant end-to-end savings in bandwidth and power.
And that will help to ensure that your next smartphone or tablet not only looks cool, but is cool too.