Time and time again when I’m presenting at events I am asked the same questions: What does the future of mobile content look like? How much performance will content developers need? When will mobile devices reach console quality? My answers to the first two always end up being vague and non-committal – surely the first depends on how much computing power is available and the second on the ambitions of the developers in question? Ultimately, those first two questions depend on the answer to the third. That is what we will address in this blog.

Theoretically, the compute potential of mobile devices had just managed to catch up with that of current generation consoles (before the Xbox One and PlayStation 4 were released recently, raising that hanging branch a little bit higher). Both superphones and consoles such as the Xbox 360 were offering about 250GFLOPS/s of computing horsepower. In fact, by the end of 2013, superphone computing performance was expected to finally equal that of first generation shader-based desktop GPUs – and as the mobile compute trend is happily following Moore’s Law, it is not looking likely to slow down its improvements any time soon.

"Yeah, but can it run Crysis?"

There is no doubt about it, these devices are pulling some amazing performance potential, so why aren’t we seeing a corresponding increase in content complexity? Why can't I play Crysis on a handset? To understand that, you have to look at the comparable bandwidth available on mobile, console and desktop.

State of the art desktops have an inordinate level of bandwidth available to them. The Xbox360 had 32GB/s. State-of-the-art mobile SoCs can currently offer a theoretical max of 8-12GB/s. And the reason why available bandwidth is so low in the mobile space is that it comes at the price of power – a sparse commodity on a battery powered device!

Whereas desktop GPUs routinely require >300W for the GPU alone and the current crop of console systems are typically between 80W and 100W, mobile devices have somewhere in the region of 3W to 7W (superphone and tablet respectively) available with which they have to power not just the GPU, but also the CPU, modem, WiFi and the display! With this restriction on power, bandwidth capacity has not grown at the same rate as compute and the mobile space remains, at this moment in time, two to three years behind consoles and eight plus years behind desktops. So, whilst semiconductor companies are striving to deliver higher levels of bandwidth within a mobile SoC, developers should also learn the methods of getting “100W” of work from the 3W available.

The realities of mobile device bandwidth.

A typical system configuration for a superphone will have a memory system that will yield approximately 8GB/second as a theoretical bandwidth limit for the entire system. Right off the bat, as the memory system is not 100% efficient in real life (you have to allow for memory refresh, opening and closing of pages, etc.), we need to degrade that figure to approximately 80%. Already we are at 6.4GB/second as a starting point.

From a GPU standpoint we need to add in some basic system overheads. Firstly, as we are performing a graphics function, you need to get your pixels to a display at some point. For a superphone with a 1080p screen updating at 60fps the display subsystem is absorbing about 0.5GB/second of that. Similarly, we need to leave the CPU with some bandwidth for the driver etc. This will be somewhere in the region of 0.5GB/second. This is slightly conservative, but allows for the overhead of low latency access required for the CPU. This leaves us with a total of approximately 5.4GB/second for the GPU.

Although we’ve used 80% utilisation as an average, it is worth keeping in mind that not all SoCs are created equally and there is a dynamic range here. We have seen platforms which, for various reasons, are only able to achieve 70% of theoretical max. It’s also worth mentioning that utilization often degrades with increased memory clocks due to increased latency.

How to blow your bandwidth real quick!

So with a starting point of 5.4GB/second, if we divide that by our target frame rate (60fps) we have a per frame starting point of 90MB/frame peak bandwidth. Rendering an image requires the target frame buffer to be written back from the GPU at least once. That removes about 8MB/frame leaving us at about 82MB/frame.

Dividing that by the resolution gives you approximately 40 bytes/pixel to cover everything, that includes attribute inputs, varying outputs, varying inputs and texture inputs. The texture and frame buffer bandwidth costs are per-pixel costs, the impact of others dependent on poly size, but it’s still worth thinking in terms of "per pixel" bandwidth cost.

Three of the five main suppliers of GPU technology to the mobile market – making up approximately 90% of total are using tile based/deferred rendering GPUs of one form or another and this is no coincidence. Tile based rendering allows most of the intermediate fragment write and z read/write cost to be removed as we do all of that “in tile” on the GPU, saving the vast majority of the bandwidth.

However, deferred rendering does require vertex data to be “Read, Written & Read back” as part of the binning process (the method by which tile based renderers decide which primitives contribute to which screen tile sections). This is where most of the bandwidth is used.

Let’s consider a simple agnostic example of this. For each primitive, which, for the sake of argument, we’ll say is a tri-strip optimized mesh (i.e. we only need one unique vertex per new triangle) which has at least:

Position (3*FP32[XYZ] = 12 bytes)
Some form of colour information (3*FP32[RGB] = 12)
Plus one texture source (2x FP32[UV] = 8 bytes)
Total = 32bytes.

Now, as said above, we need to read, write and read back that information, so we basically use 96 bytes of bandwidth. While caching etc. has an effect, everyone does it slightly differently, so for the sake of simplicity let’s say that’s compensated for by the tri-strip optimization for the moment.

Now, we’ve specified that we have a texture. Assuming we do nothing else but light it and that all the per-fragment inputs are register mapped, we need to fetch at least one Texel per-fragment so an uncompressed texture would be 1x INT32[RGB] = 4 bytes. Assuming we apply a limit of 10 fragments per primitive as per the guide figure for primitive to fragment ratio we discuss in “Better living through (appropriate) geometry” we have a total of 40 bytes for the fragments. Working this through you can basically see that we are hitting the bandwidth limit in a very simple use case which would yield approximately 603K polys/frame or about 36M polys/sec. Now that sounds like a lot, but we haven't done anything “interesting” with the polygon yet. By increasing that by an additional texture source or adding a surface normal etc. that number comes down pretty quickly.

Let's have a look at what happens when we do some very basic things to our polygon. We can't really do anything about the position without it getting a bit complicated (although it was popular in the early days of 3D games to send objects with compressed object space co-ordinates and then scale using the transform), but instantly we can bring that colour value down to a packed RGB:888 value to reduce the overhead by a third. We can also halve the texture co-ordinate size to FP16 for U and V, this is no hardship if you use normalized co-ordinates as they can be scaled and calculated at a higher resolution inside the shader code.

Position (3*FP32[XYZ] = 12 bytes)
Some form of colour information (1*INT32[RGBA] = 4 bytes)
Plus one texture source (2x FP16[UV] = 4 bytes)
Total = 20 bytes.

Now we've gone from 603K polys/frame to just over 1M polys/frame or 60M polys/sec. If we apply texture compression to the texture source we can get the 4 bytes per texture fetch to 4 bits using ETC1 (5 bytes for our 10 pixel poly) or down to 2 bits (or lower) using ASTC (2.5 bytes for our 10 pixel poly). This brings us up to 1.26M to 1.3M polys/frame or 75M to 78M polygons/sec, which I'm sure you'll agree is a hell of a lot more impressive.

So you can see that making the most out of the performance available without killing the bandwidth requires slightly different (but not excessively tricky) thinking compared to desktop or console.

Next time... "It’s Just Criminal! Examples of performance thrown away in real Apps"

The “Global Mobile Game Confederation”, referred to as GMGC, was founded on September 14, 2012 in Beijing by leading figures of the global mobile game industry. It was established to promote exchanges and cooperation of mobile game technology in China and abroad, to share the latest global technological achievements and experience, to explore the future of mobile technology trends, to enhance the level of R & D management, to apply technological innovation and to cultivate high-end comprehensive talents for the mobile game industry. Today, GMGC has developed into the largest mobile gaming association in the AsiaPac region with significant global influence. As of the end of 2013, GMGC had more than 120 members who are from around 13 countries, including China, South Korea, Japan, Denmark, Australia, Germany, India and so on.

GMGC offers "4C" services to all the members . "4C" means Confederation, Conference, Developer Camp and Consulting. Through the different "4C" services, the members can engage with each other, and explore the collaborations that they want

ARM has co-worked with GMGC from 2012. Not only did ARM attend the 1st and 2nd GMGC (Global Mobile Game Congress) and GMGDC, but also GMGC supported ARM's events as well, such as the 2013 ARM Ecosystem Summit.

Figure 2 ARM at GMGC2013 Figure 3 David Song (Founder of GMGC) at ARM Ecosystem Summit 2013

Through working with GMGC, ARM has built and strengthened its partnership with the leading mobile gaming partners in China and built the effective connections with different developers who are creating the amazing gaming content which runs best on an ARM® CPU and Mali™ GPU based devices. In order to better leverage the GMGC platform and build a flourishing gaming ecosystem, ARM joined GMGC from 2014.

For GMGC, the membership geography of GMGC becomes more integrated with ARM's joining; and for ARM, by becoming a member of GMGC, we can better serve our ecosystem partners and become a more constructive member in gaming industry !

By the way, 2014 Global Mobile Game Congress (www.gmgc.info ) will be held in Beijing from 26th to 27th March. It is a great opportunity to meet your old and new friends from the mobile gaming industry and experience the latest gaming technologies! Of course, as a member of GMGC, ARM will bring the latest graphics and gaming demos running on AMR Mali GPUs to GMGC 2014 and exchange our insights and knowledge with developers. It will be a big party for mobile games, and you shouldn't miss it!

Every time I stand up at GDC and give a presentation on how to improve the performance of mobile applications there is always one person in the crowd that fills in the feedback form with the following – “Well, this is all obvious. I’ve been doing this for years!” I do wonder if it’s the same guy every time or different people, but, my friend, I’m here to inform you that sadly, you are in a minority.

I have seen the following frequently in my role as Director of the ARM® Mali™ Performance Analysis team and in my mind, ignoring the tips below should be a crime against graphics performance (mostly because I like the image of bursting into a room and shouting “Book ’em Danno, performacide in the 1st!”*). I’ve picked out some of the more recurrent themes from the last few years with the hope of a little crime prevention…the names have been removed to protect the innocent.

*Being English and a child of the 70’s I did want to put a quote from the Sweeney in, but Jack Regan’s utterance wouldn’t make it past marketing in our postmodern, more sensitive world.

“Overdraw: A Victimless Crime?” or “Sort your objects and nobody get hurt!”

This is by far the easiest optimization to implement, but the majority of apps we see still don’t use it (including not one, but two widely used graphics benchmarks). It’s an amazing thing, seeing the surprise on the face of the developer when you show him how easy this is to implement and the effect it can have on performance. Apparently qsort() is a very overlooked function in libc.

Simply put, you order the objects on the Z values of their origin and submit in front to back draw order. This ensures optimal performance of the Z buffer. If you want to get fancier you can do it based on the bounding sphere or box for larger objects with potential overlap. “But what if I have objects with Alpha?” You simply separate out objects containing Alpha, then order those based on Z, same as we did with the opaque objects. Draw the opaque objects first and then draw the Alpha objects.

From my crime files the worst offender I’ve seen was an app with an average overdraw of 12x (see my previous blog to get an idea of what effect this has). My team showed the developer the version with sorted objects and they ended up with a 2-3x performance boost.

“GBH: Grievous Batching Horrors”

There is a common misconception that calls to the driver are free. Unfortunately, this is not true. Perhaps we have only ourselves to blame as we make every effort to make it seem that way, but every OpenGL® ES API function call has CPU overhead. For some functions that overhead is bigger, for others it’s smaller; this is largely dependent on how much state they affect. Naturally, functions such as draw calls (calls to glDraw* functions) tend to carry more overhead because they use the state information.

This sounds like a basic concept, but you would be surprised at what is done in apps. Issuing excessive draw calls is generally a bad idea. One notable example was an app which sent a single quad per draw call because it used a different part of a texture (no, I’m not kidding) for each quad meaning it used 700+ draw calls per frame to draw the scene. On lower end (single core ARM11xx™ class) platforms, this consumed almost as much CPU time as it did to draw the scene in the first place.

Generally draw calls consume less time on deferred rendering GPUs because the driver only needs to ensure it has a snap shot of the state of the buffer etc. A deferred renderer won’t actually engage the hardware to draw anything (usually the point which would cause overhead) until either a glSwapBuffer, glFlush or similar condition requiring the draw calls to be resolved. This means that a lot of the cost can be offset by utilizing today’s modern multi-core CPU environments by asynchronously performing the data/state preparation on a separate thread which happens in parallel with the running app and driver.

However, there is still an overhead and this overhead varies little with the number of primitives being drawn by the draw call. The overhead is similar whether you draw a single triangle or thousands of triangles in a draw call. So if you combine multiple triangles into a single draw call the overhead is only applied once rather than multiple times. This reduces the total overhead and increases the performance of your application. For some very neat ideas on how to combine draw calls more effectively, start with this blog.

“I’d like to plead insanity your honor.”

Having said all of that, don’t go crazy! You need to bear in mind that large object batches with high potential for occlusion (single, very large scenery objects with large portions on and off screen or a number of smaller objects distributed over a large area in the scene are good examples) can be unnecessarily costly as the vertices still need processing to determine position before visibility culling.

For very large objects with dense geometry, it is always worth implementing a hierarchy of bounding boxes and checking each child box for visibility rather than sending the whole object and letting the GPU work it out. Again we have seen examples of objects in apps which have vertex counts in the 50K regions where only 20-30% of the object is visible at any one time.

The bandwidth cost of those vertices and processing time in the GPU versus a simple bounding volume check against the view frustum is likely to be an order of magnitude difference. That’s an order of magnitude for the sake of a bit of judicious app-side culling…

“Bound over (and over and over…)”

Unfortunately this is seen regularly in a lot of commercial engines, for some reason, but can cause flush and reloads in tile/cache memory. The optimal use case is to bind once, issue all draw calls and then unbind.

Why? This is because in most deferred rendering the GPU works on a small section of the screen at a time, commonly referred to as a tile. That tile is an ‘N’x’N’ sub region of the screen. What the driver and the GPU try to do is retain the tile it’s working on for as long as there is work to be done on it. Binding and unbinding between draw calls means the driver has to second guess what you wanted it to do. If it’s not sure it has to err on the side of caution and write back the tile to main memory.A re-bind of the same target after an un-bind can see the tile ping-pong into and out of memory.

Remember the driver gets very little information about what your intent is (hopefully this will be fixed in future revisions of OpenGL ES, but for now we have to live with it), so making it second guess is always a bad idea. If you leave the draw target bound then you are explicitly telling the driver “yes – I’m still drawing to that”. Also, take a look at the use of glDiscardFramebufferEXT() which helps indicate to the driver when the render attachment is complete.

Next... I will be exposing more crimes against performance in “It’s Just Criminal! (Part 2)”, in which PC Ray Caster will put before the jury the case of the year.

Welcome to my series of blogs on accelerated computing technologies! Special-purpose hardware designed to execute certain computations (prominently, GPUs designed to execute graphics computations) is expected to provide better performance than general-purpose hardware (prominently, CPUs). Better performance typically means faster or accelerated execution, but often means lower energy consumption as well. Expectations of better performance of course imply that software is also up to scratch.

As a way of introduction, I have been working on accelerated computing technologies for over ten years, first with CPU vector extensions like ARM Neon, then with vector co-processors like ClearSpeed CSX and Cell SPE, and more recently with GPUs supporting parallel computations like ARM Mali. I have experienced switching from just using vendor-specific APIs to both implementing and using vendor-independent standards such as OpenCL. Also, I have experienced both working in academia and industry which is bound to affect what I am going to write about.

I am aiming at engineering-minded people out there, so you should expect facts and informed opinions, no hype no politics. Following this picture

I am telling you there is a better way of writing software for accelerated systems. Stay tuned!

The annual Game Developers Conference (GDC) is fast approaching. The ARM team is in the midst of finalizing preparations to try and make this GDC our best one yet – it will definitely be our biggest! We have some great activities planned for the week and will be joined by a host of ARM partners at both the ARM booth (#1616) and the speaking sessions in West Hall Room 3014, many of whom have announcements which we can’t wait to share!

Joining us on the ARM Booth...

There is a wealth of variety in the tools, game engines and middleware available to developers who are looking to expand into the mobile gaming market. These can help ease the effort of development, improve the performance of your app or even add differentiating features. GDC is a great opportunity for developers to discover these and at the ARM booth this year we will be joined by an array of Ecosystem partners who offer such solutions, all of which have been optimized for ARM-based devices. We are especially excited to be joined by several Chinese partners (Tencent, Sohu, Perfect World and Chukong/Cocos2d-x) who will display their latest gaming technology and demonstrate the huge opportunities that are emerging in the Chinese mobile gaming market. Between these partners there are some exciting developments which it is worth checking out: Sohu has recently released an open source version of their Genesis-3d engine; Chukong/Cocos2d-x have launched their new Cocos3d-x engine; the most popular Chinese 3D mobile games will be demonstrated by Tencent and Perfect World will be there with their ECHO-es engine which powers their famous 3D adventure and fantasy MMORPG.

Also on the ARM booth will be PlayCanvas and Goo Technologies who are showcasing their HTML5 solutions, optimized for mobile, which can enable you to make rich and engaging browser-based games (which could be fantastic for the DTV market, for example). Geomerics Enlighten™ technology, which has already been revolutionizing the console gaming experience and is set to rapidly accelerate the transition to photo realistic graphics in mobile, will also be on hand to explain to developers how you can get access to their solution. TheMarmalade staff will detail the benefits of their SDK, which allows developers to code once and distribute cross platform, across any mobile device, as well as Marmalade Quick, Web Marmalade and Marmalade Juice which all build on top of the SDK, providing alternatives for developers using Lua, HTML or Objective-C® software. You can also learn how to boost you code’s performance at Umbra’s pod where they will show their smart culling technology. The Havok™ Anarchy™ engine will be on a pod and at Simplygon’s pod developers can discover how to optimize their content for ARM-based, mobile platforms to deliver the AAA game which you aimed for. Finally, the Samsung Developers pod will host the Samsung Chord SDK which is designed to help developers who have no knowledge of networking develop local information-sharing applications and apply them to games, like Asphalt 8 by Gameloft.

Screenshot from Project Anarchy

Find out more about ARM technology

ARM will be presenting five times at GDC (all in West Hall Room 3014!), covering key subjects of the moment for mobile gamers. We will discuss ARM’s recent game middleware technology acquisition (the Enlighten solution by Geomerics) and how this solution is developing, along with how to efficiently use and develop for a deferred rendering architecture (the architecture mainly used for battery constrained mobile devices). Gameloft will join us for another talk to go through the different methods which you could use to optimize AAA games for mobile. The next talk will cover OpenGL® ES 3.0 and beyond, with a good sideline on ASTC Full Profile and how its flexibility and quality can make the world of difference to your games. Covered in our fourth talk will be how you can best maximize the performance of the ARM Cortex® CPU, the IP within over 95% of mobile devices, by using ARM NEON™ technology and/or multithreading. Finally our last (but not least) talk of the show covers the proven tools ARM offers to game developers and, together with our partner, EA, we will show attendees how best to use these tools to create fantastic, high-performing games.

The full talk abstracts and schedule can be found here.

Discuss the latest techniques with industry experts

For those who want to learn a little more about the solutions on the ARM booth or who can’t make our talks above, ARM engineers and our partners are going to be presenting at the ARM Lecture Theatre within our booth. They will be joined by other ARM partners such as Testin and their cloud testing platform, Autodesk live-drawing professional quality digital art using their SketchBook® Pro software optimized for Android devices and Guild Software with their renowned Vendetta Online. The schedule is packed with items of interest to mobile developers, so please check the schedule and topics here to discover what might interest you.

In the link you will also find several other talks from the ARM Developer Education team, covering best practices of developing games for mobile, optimizing your WebGL™ application for mobile, using the ARM tools for game development and the different content preview of the five ARM GDC sponsored sessions.

Show off your skills in the ARM Game Challenge!

Game developers can chill out in ARM’s Gaming Zone where hit apps which have been optimized to deliver great visual effects using the ARM tools will be available– these include such titles as VW Sports Car Challenge, Real Racing 3, and Asphalt 8. If you’re feeling lucky, take part in the Sports Car Challenge 2 for your chance to win a Samsung Galaxy Note 10.1 2014 Edition– prove yourself to be the fastest gamer of the show by getting the fastest lap time and claim your prize at the end of the day!

Discover a new way to create digital art in the ARM Artist Competition

ARM, together with Autodesk and Samsung Developers, will be highlighting the digital art capabilities of the Samsung Galaxy tablets. The Samsung Galaxy NotePRO 12.2 is a well-recognized tool for developers which enables them to design on the go, freeing themselves from their desk – and the SketchBook Pro software by Autodesk is great for this task too. Test the devices on the ARM booth to discover their potential - and if you draw the best picture of the day you will be in with a chance of winning a Samsung Galaxy Note 8!

For more information on the Artist Competition, check out Akshay’s blog, coming here soon.

For information on the ARM demos on show at GDC, check out Phill Smith’s blog Expanding the Mobile Gaming Experience at GDC 2014

Meanwhile, I need to get back to my GDC preparation – I look forward to meeting everyone next week at GDC, please drop by the ARM booth and say hello!

In the first two blogs of this series I introduced the frame-level pipelining [The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining] and tile based rendering architecture [The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering] used by the Mali GPUs, aiming to develop a mental model which developers can use to explain the behavior of the graphics stack when optimizing the performance of their applications.

In this blog I will finish the construction of this abstract machine, forming the final component: the Mali GPU itself. This blog assumes you have read the first two parts in the series, so I would recommend starting with those if you have not read them already.

GPU Architecture

The "Midgard" family of Mali GPUs (the Mali-T600 and Mali-T700 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.

The exact number of shader cores present in a particular silicon chip varies; our silicon partners can choose how many shader cores they implement based on their performance needs and silicon area constraints. The Mali-T760 GPU can scale from a single core for low-end devices all the way up to 16 cores for the highest performance designs, but between 4 and 8 cores are the most common implementations.

The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.

The shader cores in the system share a level 2 cache to improve performance, and to reduce memory bandwidth caused by repeated data fetches. Like the number of cores, the size of the L2 is configurable by our silicon partners, but is typically in the range of 32-64KB per shader core in the GPU depending on how much silicon area is available. The number and bus width of the memory ports this cache has to external memory is configurable, again allowing our partners to tune the implementation to meet their performance, power, and area needs. In general we aim to be able to write one 32-bit pixel per core per clock, so it would be reasonable to expect an 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle.

Mali GPU Shader Core

The Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable "tripipe" execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs.

The Tripipe

There are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines, but GPU variants with up to four pipelines are also available.

Massively Multi-threaded Machine

Unlike a traditional CPU architecture, where you will typically only have a single thread of execution at a time on a single core, the tripipe is a massively multi-threaded processing engine. There may well be hundreds of hardware threads running at the same time in the tripipe, with one thread created for each vertex or fragment which is shaded. This large number of threads exists to hide memory latency; it doesn't matter if some threads are stalled waiting for memory, as long as at least one thread is available to execute then we maintain efficient execution.

Arithmetic Pipeline: Vector Core

The arithmetic pipeline (A-pipe) is a SIMD (single instruction multiple data) vector processing engine, with arithmetic units which operate on 128-bit quad-word registers. The registers can be flexibly accessed as either 2 x FP64, 4 x FP32, 8 x FP16, 2 x int64, 4 x int32, 8 x int16, or 16 x int8. It is therefore possible for a single arithmetic vector task to operate on 8 "mediump" values in a single operation, and for OpenCL kernels operating on 8-bit luminance data to process 16 pixels per SIMD unit per clock cycle.

While I can't disclose the internal architecture of the arithmetic pipeline, our public performance data for each GPU can be used to give some idea of the number of maths units available. For example, the Mali-T760 with 16 cores is rated at 326 FP32 GFLOPS at 600MHz. This gives a total of 34 FP32 FLOPS per clock cycle for this shader core; it has two pipelines, so that's 17 FP32 FLOPS per pipeline per clock cycle. The available performance in terms of operations will increase for FP16/int16/int8 and decrease for FP64/int64 data types.

Texture Pipeline

The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.

Load/Store Pipeline

The load/store pipeline (LS-pipe) is responsible for all memory accesses which are not related to texturing. For graphics workloads this means reading attributes and writing varyings during vertex shading, and reading varyings during fragment shading. In general every instruction is a single memory access operation, although like the arithmetic pipeline they are vector operations and so could load an entire "highp" vec4 varying in a single instruction.

Early ZS Testing and Late ZS Testing

In the OpenGL ES specification "fragment operations" - which include depth and stencil testing - happen at the end of the pipeline, after fragment shading has completed. This makes the specification very simple, but implies that you have to spend lots of time shading something, only to throw it away at the end of the frame if it turns out to be killed by ZS testing. Coloring fragments just to discard them would cost a huge amount of performance and wasted energy, so where possible we will do ZS testing early (i.e. before fragment shading), only falling back to late ZS testing (i.e. after fragment shading) where it is unavoidable (e.g. a dependency on fragment which may call "discard" and as such has indeterminate depth state until it exits the tripipe).

In addition to the traditional early-z schemes, we also have some overdraw removal capability which can stop fragments which have already been rasterized from turning into real rendering work if they do not contribute to the output scene in a useful way. My colleague Sean Ellis has a great blog looking at this technology -Killing Pixels - A New Optimization for Shading on ARM Mali GPUs - so I won't dive into any more detail here.

GPU Limits

Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.

The GPU can issue one vertex per shader core per clock
The GPU can issue one fragment per shader core per clock
The GPU can retire one pixel per shader core per clock
We can issue one instruction per pipe per clock, so for a typical shader core we can issue four instructions in parallel if we have them available to run
- We can achieve 17 FP32 operations per A-pipe
- One vector load, one vector store, or one vector varying per LS-pipe
- One bilinear filtered texel per T-pipe
The GPU will typically have 32-bits of DDR access (read and write) per core per clock [configurable]

If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:

Fillrate:
- 8 pixels per clock = 4.8 GPix/s
- That's 2314 complete 1080p frames per second!
Texture rate:
- 8 bilinear texels per clock = 4.8 GTex/s
- That's 38 bilinear filtered texture lookups per pixel for 1080p @ 60 FPS!
Arithmetic rate:
- 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS
- That's 1311 FLOPS per pixel for 1080p @ 60 FPS!
Bandwidth:
- 256-bits of memory access per clock = 19.2GB/s read and write bandwidth¹.
- That's 154 bytes per pixel for 1080p @ 60 FPS!

OpenCL and Compute

The observant reader will have noted that I've talked a lot about vertices and fragments - the staple of graphics work - but have mentioned very little about how OpenCL and RenderScript compute threads come into being inside the core. Both of these types of work behave almost identically to vertex threads - you can view running a vertex shader over an array of vertices as a 1-dimensional compute problem. So the vertex thread creator also spawns compute threads, although more accurately I would say the compute thread creator also spawns vertices .

Next Time ...

This blog concludes the first chapter of this series, developing the abstract machine which defines the basic behaviors which an application developer should expect to see for a Mali GPU in the Midgard family. Over the rest of this series I'll start to put this new knowledge to work, investigating some common application development pitfalls, and useful optimization techniques, which can be identified and debugged using the Mali integration into the ARM DS-5 Streamline profiling tools.

Comments and questions welcomed as always,

TTFN,

Pete

Footnotes

... 19.2GB/s subject to the ability of the rest of the memory system outside of the GPU to give us data this quickly. Like most features of an ARM-based chip, the down-stream memory system is highly configurable in order to allow different vendors to tune power, performance, and silicon area according to their needs. For most SoC parts the rest of the system will throttle the available bandwidth before the GPU runs out of an ability to request data. It is unlikely you would want to sustain this kind of bandwidth for prolonged periods, but short burst performance is important.

Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

Here at ARM we continuously work to increase the quality of our driver software. Thorough testing is a significant proportion of our driver production process and its efficiency enables us to deliver drivers that meet and exceed our partners’ quality expectations sooner than we would otherwise be able to. You might have seen the announcement made in fall 2013: “ARM Ltd., the leading semiconductor intellectual property (IP) supplier, expands its quality assurance processes with the adoption of the OpenCL™ and OpenGL® ES 3.0 test modules of the drawElements Quality Program – the dEQP™”. This subscription was all part of ensuring that as the graphics industry evolves, the drivers that we deliver continue to be of the highest standard possible.

Based on our experience with the ARM® Mali™-400 GPU series we certainly had confidence in our test coverage for OpenGL ES 2.0 which we built up over multiple releases. Despite the fact that the ARM Mali-T600 GPU series is a radically new design comprising a unified shader architecture, the pool of test cases targeting the API version 2.0 was easy to re-use right from the start. But for OpenGL ES 3.0, being a new API, there was barely anything out there - real world OpenGL ES 3.0 content was still to come. We based our initial testing on the conformance test package from Khronos and, to a much larger extent, on in-house testing of the new features. However, we wanted to take the quality of the driver higher than these two processes allow in order to exterminate any stubborn bugs. To do this, an external testing suite was in order. Why? Well, it’s good that you asked.

For brand new features our in-house testing is typically what you might refer to as "white box" testing. Engineers familiar with the driver’s details develop targeted tests against new features, based on the OpenGL ES 3.0 specification from Khronos. If you want to factor in the inflow of new colleagues one might be willing to shift it into the "gray"-zone, but certainly the tests are not of the "black box" kind. Whereas such internal driver knowledge makes it possible to write tests targeting even very specific driver behaviour it ends up creating a one-sided view of driver performance. Engineers just "know more than they should" for developing black-box tests. Yet such black-box tests are vital to perform because the ultimate end-user, our partner, will not have the same intricate knowledge of ARM software as our engineers and so their actions and experience will be quite different.

Still, one might raise the question “Your driver passed the conformance tests - what else is left to test then?” There's a short summary written up here describing how one obtains confidence on a per-feature basis from the conformance test package. But ARM is more interested in the combination of features - which is what one typically uses in real world applications – and this has less coverage. So even though we passed conformance, if we did not perform additional in-house testing there could be a higher number of bugs going out and impacting our partners and our only method for finding and fixing them would be partner feedback. Hardly an ideal situation.

So, what were our expectations when licensing an external test suite, adding more than 30,000new test cases to our in-house testing? Pass all of them? That would have been really cool and deserved plenty of cake (our replacement for beer here in Norway). The reality was that, when running the latest Mali-T600 GPU driver on an Exynos 5250 based platform running Linux with dEQP version 2013.4.1, we happily passed 98.5% of the OpenGL ES 3.0 functional test group and an even larger part for OpenCL 1.1, although we did not pass all of them - which, at the very least, proved to us the value in the drawElements’ testing suite.

If your testing group tells you that there are roughly a hundred new cases waiting for further investigation, your first response certainly is not "Yeah - great!". But thinking a bit more broadly, maybe it should have been. Getting over a hundred failing test cases "in" all of a sudden certainly has an impact on daily work and schedules. But that's what we bought them for - to prevent partners and developers from discovering these issues over time. It's better to see all potential issues in one go than waiting for them to trickle in across one or two years from the partner or even the developer community. Within ARM’s release schedule which is, due to the target market, quite different from what you might be used to from your desktop GPU vendor, there is no room for a "quick-fix" once a driver is out. So everything we find and fix in our products upfront is very important to ARM and our partners.

dEQP provides challenging test cases for a variety of areas. The ones most interesting to us are:

"Exhaustive positive and negative testing of individual API features"

The situation of positive testing is quite clear: if our driver did not allow something the specification requires, we have a severe bug. Luckily for us we passed that hurdle well.

On negative testing the situation is a bit different: In the case that our driver allows things it should not, is this really a problem? Isn't that perhaps more of a feature, given it works in a sane way? Actually, it is a problem as it causes fragmentation on the market and leads to the unfortunate situations of "But this works with Vendor A!". Those issues will hit developers when they start to migrate from a single development platform into the wild world to test their apps. If "outside the spec" behaviour is considered to be valuable it can always be properly captured in an extension specification.

Similarly, negative testing involves testing error handling by executing API calls which are not even supposed to work due to, for example, using the wrong parameters. Even though it is not standard practice to base application behaviour on the specific error code returned, we recognize the importance of correct error codes to assist debugging problems during development (you might want to read further about the debugging extension which eases the situation a bit). Long story short – with the help of dEQP we greatly improved our ability to return the right error codes.

“Stress tests”

There is one error an application should always be aware of: The famous GL_OUT_OF_MEMORY. This one is raised whenever there are no system resources left to successfully complete an API call. One scarce resource is the amount of available (and free to use) memory. The drawElements’ test suite covers that part by forcefully driving the system into a low memory state to check how stable the driver can handle the situation.

As we saw during testing, this is a difficult situation to operate in. The Android™ OS, for example, has a low-memory process killer triggering at a higher threshold than the one on plain Linux, sometimes not even leaving time for the application to properly close before it is killed by the system underneath. Passing these tests on every platform is a challenge, but a challenge that we are overcoming more rapidly with the help of drawElements’ testing suite.

"Precision tests"

Due to the way precision is specified by OpenGL ES 3.0, testing for it is a challenge. Rounding behaviour and INF/NAN handling are implementation defined and only the least amount of precision to maintain is specified. We realize it is challenging to come up with stable test cases as soon as they touch any of these "implementation defined" areas. And a few tests do touch on these areas . So when it comes to answering the question of whether unexpected (failing) test results are still valid results within the constraints of the specification we spent quite some time verifying that our driver, the GPU compiler backend and finally the GPU all treat 'mediump' and 'lowp' precision qualifiers as mandated by the specification. In the end, the effort between us and drawElements was well spent on those cases. For example, 'mediump' is a great tool for saving memory and bandwidth and to reduce cache pressure where ever possible. But bear in mind that it is up to the developer to ensure that calculations stay within the least limits. For more details I refer you to Tom Olson's detailed series.

"Cut-down real-world use cases like shadow algorithms, post-processing chains for SSAO, complex transform feedback chains"

These test case are the most difficult ones to investigate due to their complexity. We take them one by one and as we build up confidence in the earlier areas we get better and better at pointing out which assumptions might be wrong in the complex test cases. Sometimes we might even consider test cases as "overly complex/complicated to achieve a specific goal", but the question of "why should one do this" is no excuse if they don't work on our driver.

So far ARM has closely investigated around 130 test failures reported by drawElements’ test suite for OpenGL ES 3.0 which were neither covered by ARM’s existing test set nor by the Khronos conformance test suite. Compare that number to the amount of passed tests, which is over 35000! Roughly half of these failures were real bugs in our drivers, whereas the other half we found targeting behaviour outside of the specification. And what happened with those tests we found to be in conflict with the specification? Well, there are excellent engineers working at drawElements who take feedback seriously and certainly won't accept it blindly. Brief e-mail exchange was usually enough to decide whether the fixes were needed in the test case or the implementation. If a case is really ambiguous and not easily decidable based on the current specification we can raise the problem together within the Khronos Group to find an agreement amongst all participating members.

Last but not least - such a big test suite is also valuable for automated regression testing. Whenever you pass you must remember it was just a snapshot of your code base which was okay. New features, optimizations, refactoring, bug fixes - all carry a risk of breaking something unwontedly. With dEQP now part of our regression testing safety net, ARM’s confidence in the quality of our releases is even stronger.

Continuing on from “It’s Just Criminal! Examples of performance thrown away in real Apps (Part 1)” lets look at more criminal behavior and for me what has to be the crime of the year...

“Possession is nine tenths of the law.” Or, “The Untouchables.”

Client side buffers really shouldn’t be your first choice on a mobile platform. Any buffer administered by the client has an unknown history. The driver has no idea what has been done between uses unless it scans it. This can be an extremely costly affair so is mostly not an approach that driver implementers take, preferring to recommend Vertex Buffer Objects (VBOs) instead. Because the status of a client side buffer is not deterministic in a deferred rendering GPU (which is, as previously discussed, about 90% of the mobile market) the driver will have to take a copy of the client side buffer being referenced. This has a bandwidth overhead and also a memory footprint overhead.

VBOs on the other hand have a prescribed interface in the API so it is possible (to a certain extent) to track the provenance of updates to the buffer meaning the driver will only take a copy when it needs to and can often “dirty patch” the updates so it only requires the difference between the pre and post modified buffer. This can save a lot of potential bandwidth.

One of the biggest offences we’ve seen in this category is using VBOs, but uploading the contents of a client side buffer into the VBO for each draw call. This kind of defeats the object. Similarly, overusing dynamic VBO or index buffer object updates using glBufferSubData() etc. causes the same issue. We’ve seen a couple of applications recently which are tweaking several small (in the order of 10-15 vertices) within a larger VBO, which are not localized, on each new draw call. This is not as bad as client side buffers, but if the object is that dynamic it really should be in its own draw call and VBO.

See this blog for more details of vertex buffer objects in action.

You also need to pay similar attention to the use of glTexSubImage() updates. Remember: in a deferred renderer no draw activity happens until eglSwapBuffers() or similar is called. If you update a texture several times within a frame that means that all possible forms of that texture must exist at the time of rendering. Overuse of partial texture updates can have a detrimental effect on bandwidth and working footprint.

“Up on multiple counts.”

Multiple Render Targets (MRTs), or the ability to target more than one output buffer with a single draw call, is a common way of reducing the need to send expensive geometry multiple times to build up secondary buffers and is often used in app-side deferred render flows (not to be confused with the deferred hardware model). Since this technique is new for OpenGL® ES 3.0 I’ll apply some leniency, but of the applications so far we’ve still seen some suspicious behavior!

MRTs can be implemented very efficiently on deferred rendering GPUs, if you can keep everything in the tile. Guidance from most of the GPU vendors with deferred architectures (i.e. tile based) is to make sure that the sum of bits/fragment fits within the maximum width on tile storage – bear in mind that each GPU provider will have different criteria, but consensus seems to be 128 bit is a good number to work to. Also keep an eye on the alignment of field for each target (it’s unlikely hardware will allow you to do arbitrary bit field assignments).

As I said there are limited numbers of OpenGL ES 3.0 applications available today, but we have seen at least a couple which use four MRTs (the primary colour buffer and three secondary buffers). In OpenGL® & OpenGL ES all the targets in a MRT need to be the same size and format as the primary. For this use case we had 4xRGBA buffers, which is fine, but when we examined the buffers only 1-2 channels from each target were being used. “So what?” you may say, “It’s all in the tile so I use a little more, big deal”, but at some point you will want to write those buffers back to memory and read them back when you perform your resolve/consolidation pass. It’s going to be a lot cheaper to pack them into two MRTs at full resolution than have to write and read back four.

If you want the optimal implementation of the deferred rendering model and you don’t mind using an extension you might want to take a look at this paper from Sam Martin of Geomerics . By using the extension described, for most cases you can eliminate the need to write back the tile and then read it back as a texture source for the resolve/consolidation pass, saving even more bandwidth.

“Obstructing the GPU in the course of its duties”

Deferred GPUs pipeline the operations required to create a frame. This means that frames move through stages which build a command stream, perform vertex shading and finally perform fragment shading and output. Which means that there are actually three frames in flight and the one you are working on app side is actually Frame N+2. Within this pipeline commands such as glReadPixels(), glCopyTexImage() and Occlusion Queries can block the pipeline and degrade performance if not used carefully… and unfortunately pretty much every app I’ve seen using these mechanisms has committed this crime.

Firstly, if using the command glReadPixels() make sure you use it with pixel buffer objects (PBOs). This schedules the actuallypixel read back from the buffer asynchronously(often hardware accelerated) and the glReadPixels command returns to the calling thread immediately without stalling the application. To read the content of the buffer you need to bind and map the PBO (see glMapBuffer()). At the point at which you attempt the map operation if the rendering to the buffer isn’t complete the map operation will still stall until rendering is complete. Therefore the best advice is to pipeline these read backs where possible such that you are using the results from frame N in frame N+2 or, if that’s not possible, to separate the dependent operations as much as possible and then use fence and sync to ensure coherence.You might consider using a shared context and placing the wait for read back on an asynchronous thread. I’d also apply the same advice to glCopyTexImage().

The advice for Occlusion Queries is very similar. Polling for the result of an occlusion query immediately creates a stall (this is true on all GPUs, not just deferred). Therefore the advice is to always pipeline occlusion queries.

“Putting the squeeze on the perp.’”

Not compressing your textures is a bit like speeding. We’ve pretty much all done it, it’s easily done, we don’t think about the consequences and everyone has an excuse, but there isn’t one. However, unlike speeding I think that not compressing textures should be a capital offense. Compressing your textures has massive impact on bandwidth, reducing it 2x, 4x, 8x or more, and is an essential part of authoring for maximum performance in mobile devices.

So what’s the excuse? Back in the days of ETC1, there was the defense of “but it doesn’t do Alpha m’ lud”, that, however, could be worked around (see this Mali Developer article). With the introduction of OpenGL ES 3.0 that defense has been eliminated by the inclusion of ETC2 which now has Alpha support. However this has given rise to the “Matrix Defense”; let me explain…

Consider the “Matrix” below which shows the available compression formats in the world that developers have been used to. Only a very narrow selection of input bit rates, pixel formats and encoding bit rates can be compressed. The defence is that in the “Matrix”, developers can’t get the exact format they want…

Time to take the red pill. With ASTC this is the new reality:

Adaptive Scalable Texture Compression, the standard developed by ARM and officially adopted by The Khronos Group as an extension to both the OpenGL and OpenGL ES graphics APIs, is the best method available, offering increased quality and fidelity, very low bit-rates and just about every input format you may want or need. Independent testing of ASTC has shown that quality levels similar to 2bits per pixel in existing compression schemes can be achieved using the next level down in ASTC, saving further bandwidth for the same level of quality. So now there is no excuse!

“(Not so) Supergrass!”

To close out this blog, I’d like to give you my personal pick of crimes against performance from 2013. We join the court as the prosecution presents its first witness…

PC Ray Caster:

“If I may present the evidence, your honour….

"Whilst proceeding in our usual routine activities we happened upon some suspicious activity. The performance analysis team regularly, as new games or graphics focused applications appear in the app stores, run them and investigate how they are using the GPU. This helps us maintain a view of trends in workload, application composition etc. which helps us shape the future direction of our roadmap.

“Our attention was attracted to this particular app when we noticed that it was consuming an unusually large amount of bandwidth for the visual complexity of the scene. “Hello, hello, hello!” we said to ourselves, “What’s all this then?” Upon further investigation the crime scene revealed itself. The crime scene consisted of a field of long grass.

“We proceeded to assess the crime scene and the grass was found to be made up of imposters*, which is what we expected as this is a fairly standard technique for rendering scrub/foliage etc. In this particular case the imposters were made up of two quads which intersected each other at the mid-point of the X access at 90° to each other. Again, this is all fairly standard stuff.

“The forensics team used the Mali Graphics Debugger to look for evidence of foul play and immediately the full horrors of this case began to unfold. As we stepped through the first issue became immediately obvious: the imposters were being drawn back to front. We let the frame complete and then checked the stats. The overdraw map showed a peak in double digits and the texture bandwidth was criminal! The grass was accounting for more than half of the total run-time of the scene.

“Continuing the investigation we found that the texture used for the grass/shrubs was also not MIP Mapped or compressed. Given the viewing angle for the scene and distance from the viewer of each shrub imposter, that meant that most of the imposters were very small causing under sampling of the texture (the mapping between texture pixels and screen pixels was less than 1:1) which was thrashing the cache and causing the excessive bandwidth consumption.

“After some more investigation we also found that rather than using “Punch through Alpha”**, the app had turned on Alpha blending, causing all overdrawn pixels to be blended with each other which was causing the engine to force the back to front ordering (alpha blended objects need to observe back to front ordering for visual correctness).

“Once the crime scene was cleaned up your honor, the application performance improved considerably. Clearly this shows a criminal neglect your honor. That concludes the evidence for the prosecution."

*You basically replace a model with a textured 2D quad which rotates to always remain view port aligned.

Imagine a cardboard cut-out of a tree that follows you to always face you and you are there!

**Transparent texels in an RGBA texture are marked with Alpha = 0 and are discarded in the fragment shader acting as a mask.

All other texels have an Alpha of >0 and are written as opaque pixels, the alpha is not used for blending.

A cheaper way to do this is also to use only RGB texture and pick either black (0:0:0) or white (1.0:1.0:1.0) as the mask value.

Judge Hugh Harshly:

"I believe I've heard enough...

“I find the defendant guilty on the charge of fraudulent use of Alpha modes liable to cause excessive bandwidth consumption, being several times over the legal limit of overdraw while in charge of a GPU, cache abuse, extortion of bandwidth, applying a texture without due care and attention and finally failure to compress a texture... a most heinous crime.

“Do you have anything to say for yourself before I pass sentence?"

Defendant:

"Its a fit up! Society's to blame! What chance did I have growing up with a desktop GPU, I don't know no different do I?"

Judge Hugh Harshly:

“Very well… clerk, hand me my black cap would you, there’s a good fellow..."

I’m asked quite a lot about how I feel about benchmarks. When I sit down to write these blogs I usually go searching for suitable quotes. For this one I found a quote that perfectly sums up my feeling.

This is from business leadership consultant Tom Peters:

"... I hate Benchmarking! Benchmarking is Stupid!”

Yep, I’m with Tom on this one, but we may need to qualify that a bit more… back to Tom:

“Why is it stupid?"

"Because we pick the current industry leader and then we launch a five year program, the goal of which is to be as good as whoever was best five years ago, five years from now."

While this statement was originally aimed at business leadership and strategy it is equally true of any type of performance benchmarking.

I’ve spent the last three years directly involved in and most of my 20 year career indirectly involved in the mire that is synthetic benchmarking of GPU's. Everything I've seen leads me to come to the conclusion that GPU benchmarks are a reinforcement of the above statement. They do nothing but focus the attention on macroscopic subsections of performance while purporting to tell you about the holistic performance of a GPU.

It seems a logical statement to say, that in order to provide valuable input to an end consumer’s purchasing decision it is better for GPU benchmarks to reflect real-world use-cases. Understanding how readily a GPU delivers the graphics of a user’s favorite game and the length of time that they can be played at a suitable FPS would be useful information for both consumers and OEMs alike. However, is this really the data that popular benchmarks deliver at the moment?

Desktop GPU benchmarking went through a similar evolution to the one that mobile GPUs are currently undergoing. In its earliest days it consisted of extremely theoretical and somewhat woolly comparisons of architectural triangles/second and pixels/second rates. This later developed into actual applications that purportedly measured tri/s and pix/s before arbitrary spinning objects (spinning tori/donuts, Utah Teapots and Venus de Milo’s) entered the scene, which then led to the stage that the mobile GPU benchmarking scene is at currently: the stage where benchmarks consist of synthetic game scenes designed specifically to test a GPU’s maximum compute capacity. The next development, and where the PC market currently stands, is the comparison of metrics garnered by running actual content - real games - and assessing each GPU’s merits based on that. Well there’s a novel concept! Actually using the content that people are running and care about? Shocker!

What’s wrong with current benchmarks?

Before we go any further, I feel an explanation as to why current benchmarks are not the best representation of GPU performance is needed. Current popular benchmarks claim to stress-test GPUs to discover the maximum number of frames they can deliver in a certain time period. In many ways this seems reasonable – all benchmarking really requires in order to be effective is a single figure derived from a test that is the same for all contenders and maximum compute performance of the GPU fits into this category.

However, there are a number of issues with the way GPU benchmarks do this at the moment. Informing consumers that the device is capable of delivering 300+ frames of a particular content in a fixed time period may be a useful metric in certain circumstances, but it is not when there is no content that the consumer would normally use on his/her device which exercises the GPU in the way the GPU benchmarks currently do.

To the consumer, the figure delivered by benchmarks is completely arbitrary and does not correspond to any experience that he might have of the device. It is easily possible to deliver exactly the same visual experience which the benchmarks use at much higher frame rates or, more appropriate to embedded devices, at a fraction of the energy cost and computing resources if the benchmarks were coded in a more balanced way.

Surely, when the quality of graphics is the same between a benchmark and a popular game, it is better for a consumer to know how well the GPU delivers content that uses regular techniques and balanced workloads rather than an irregularly coded benchmark?

Later we'll look at my "Tao of GPU benchmarks" and discuss what guidelines a benchmark should follow, but first lets take a look under the hood of popular content and the benchmarks that are supposed to mirror them.

But benchmarks look exactly like popular games, so what’s the difference?

As an internal project, ARM has been running in excess of 1M frames of real content from top OpenGL® ES – enabled games on the App Store, including titles such as Angry Birds, Asphalt 7 and Temple Run. We analyse multiple performance areas including CPU load, frames per second, uArch data and a tonne of GPU agnostic API usage and render flow composition data.

When you look at some examples of the data we gather in this sort of analysis, the results are quite striking. Looking at say the imagery in Asphalt 7 and T-Rex HD on the same ARM® Mali™-based 1080p device, you'd see that they appear to show similar levels of graphical user experience. This would leave a user to believe that they are constructed from a broadly similar level of workload. When we look at the results which compare a selection of popular benchmarks and a selection of popular games, we see the following:

1080P	Benchmark A	Benchmark B	Benchmark C	Asphalt 7	NFS Most Wanted	Spiderman
Avg. Vert./Frame	11K	760K	830K	200K	27K	40K
Avg Tris./Frame	12.5K	460K	780K	140K	18K	26K
Avg. Frags./Frame	3.6M	6.2M	10M	8.0M	6.8M	8.1M
Avg. Vert FLOPS/Frame	1.3M	53M	99M	11.5M	3.3M	5.1M
Avg. Frag FLOPS/Frame	80M	148M	490M	165M	116M	258M

The first and most striking observation is that whilst the fragment count for benchmarks is similar to that of popular games, the vertex count goes through the roof! And in fact, when we look more closely at Benchmark C, the use of vertices is in no way efficient.

“Do not use a hatchet to remove a fly from your friend's forehead” - Chinese proverb

The global average for primitive to fragment ratio in this benchmark at 1080p is 1:13.1 which is close to (but just the right side of) our low watermark of 1:10 which we defined in the “Better Living Through (Appropriate) Geometry” blog, compared to a ratio of 1:53 in Asphalt 7. However, examining the content draw call by draw call, 50% of Benchmark C draw calls have a ratio of less than 1:1 primitive to fragment and an additional 24% have a ratio of less than 1:10 - against a recommended guideline of more than 1:10! The same is true for Benchmark B where 66% of the draw calls are producing micropolygons.

Real games are more balanced and consistent with less micro triangles and the majority of draw calls handling more than ten fragments per triangle.

Benchmark providers admit that they use high vertex counts in order to stress GPUs with the justification being that it provides the users with “realistic” feedback on how their GPU will respond to future content. However, as demonstrated, such stress testing is not realistic as it doesn’t accurately reflect the balance of fragment and geometry used in applications that are being used by consumers on a daily basis. While the fragment rate and vertex rate of the real games shows variation, the ratios stay pretty consistent.

Benchmarks vs Real Apps: Bandwidth

One of the major effects of the geometry imbalance shown above is it does not take into account by far the most limiting factor in terms of mobile device performance: the bandwidth. It’s extremely easy to break the bandwidth limit in an instant with these high cost/low visual yield micro polygons (as discussed in “PHENOMENAL COSMIC POWERS! Itty-bitty living space!”).

Let’s take a look at the benchmarks and see what the relative bandwidth looks like when compared to the real applications:

Test Name	Frame Buffer	Texture	Geometry
Benchmark A	40%	40%	20%
Benchmark B	10%	10%	80%
Benchmark C	20%	20%	60%
Asphalt 7	35%	35%	30%
NFS Most Wanted	30%	35%	35%

As you can see, again, the real world applications are more consistent in the balance of bandwidth used across the rendering. “Benchmark A” starts off pretty well, but unfortunately it goes off the rails pretty quickly. What we see here is 3-8x more bandwidth being used for the geometry (which, as discussed in “Better living through (appropriate) geometry”, is supposed to be a container for the samples) meaning there is less bandwidth available for fragment generation - which is what the user will actually see.

The future of mobile benchmarking

So, what’s the conclusion? Well, GPU benchmarks generally still have a long way to go, mobile one more so. I am looking forward to the time when, like for desktop and console games, mobile game developers release their own benchmarks using sections from real application workloads, allowing for a far more well-rounded view of the GPU.

Until then, I have a couple of suggestions that will not only make GPU benchmarking a lot more informative for consumers but it will also leave semiconductor companies with more time to worry about how to improve GPU performance for consumer content rather than how to impress customers in the next important benchmark rankings.

I have produced the following “Tao of GPU benchmarks” as a guide which I hope people will follow:

Apply Moore’s Law.
- Moore’s Law (compute potential doubles every 18 months) applies to GPUs as much as it does CPUs.
- Year on year the average workload represented in a benchmark should not exceed double the previous year’s and it should remain balanced. This way you don’t attempt to outstrip Moore’s law.
Make it a GPU test not a bandwidth test.
- The raw bandwidth per frame at 60fps should not exceed the available bandwidth.
- The baseline for bandwidth should be set at a typical mobile device for the next 24 months
- Make the objective of the test as independent as possible from whether the device has high bandwidth capacity or not.
Tests should use recognized techniques.
- Techniques should be aligned with current best practice
- These techniques should also be relevant to the mobile market
Excessive geometry is not an acceptable proxy for workload.
- Primitive to fragment ratio per draw call should be balanced.
- Lots of benchmarks at present have far too much geometry.
- The 10 frags/prim rule should be the lowest water mark for this.
Overdraw is not an acceptable proxy for workload.
- Keep it real! An overdraw average in excess of 2x on any surface is not representative.

As I write this, I’m packing my bag for a week in San Francisco, where I’ll be attending the Game Developer’s Conference (GDC). GDC is always a blast, especially now that there’s a big ARM® Mali™ presence, but never more so (for me at least) than this year. The reason is simple; in my role as Khronos OpenGL ESWorking Group chair, I’ll have the honor of announcing OpenGL ES 3.1, the latest version of the world’s most widely deployed 3D graphics API. It’s the culmination of a year of very hard work – mostly by other people, but I’m proud to say I was in the room when the work got done. And, of course, I’m deeply grateful to those who did it. In this blog, I’ll give a personal perspective on what OpenGL ES 3.1 is and why it's important.

What’s in it?

OpenGL ES 3.1 isn’t as obviously a big deal as its predecessor, OpenGL ES 3.0, which added over two dozen major features, and extended both the API and the shading language in almost every possible direction. After all, ES 3.0 took five years to create, and was intended to drive hardware requirements for a new generation of mobile GPUs. ES 3.1, on the other hand, was done in about a year, and is explicitly designed to run on most if not all existing ES 3.0-capable hardware. It’s no wonder that by comparison, it looks like a relatively modest advance. But is it? Here’s my view:

Many of the features in the new API amount to filling in gaps in ES 3.0 (bitfield operations in the shading language! Multidimensional arrays!), and continuing our efforts (which began in ES 3.0) to tighten the specification, improve application portability across implementations, and reduce application and driver overhead. Don’t get me wrong, these features are very important – they make life much better for programmers, leading ultimately to more, better, and cooler applications for everyone. And I can tell you, specifying and testing them is hard (and essential) work. But they’re kind of hard to appreciate unless you’re a standards geek, or a graphics programmer.

However, I claim that OpenGL ES 3.1’s headline features are going to change the way we do mobile graphics, in ways that will be obvious to everyone. For my money, there are two that stand out; first, it adds compute shaders, which allow the GPU to be used for general-purpose computing, tightly coupled with GPU-based graphics rendering. Second, it adds indirect drawing commands, which allow the GPU to read drawing command parameters from memory instead of receiving them directly from the CPU. I’ll explain why that’s important in a moment.

GPU Compute in OpenGL ES 3.1

Compute support in OpenGL ES 3.1 consists of a handful of new features that sound minor when considered individually, but have huge implications when combined. (This happens all the time in the tech industry. Hypertext is a way of linking related documents and data (remember HyperCard™?), and the internet is a (large) group of networked computers that agree to exchange data using a standard set of protocols. Put them together, and you get the World-Wide Web, which is a different animal altogether.)

The first critical compute feature OpenGL ES 3.1 adds is direct access to memory: shader programs can read and write arbitrary data stored in memory buffers or texture images. The second critical feature is a set of synchronization primitives that allow applications to control the ordering of memory accesses by different threads running in parallel on the GPU, so that results don’t depend on what order the threads run in. The third is the ability to create and dispatch compute shaders, programs for the GPU whose invocations correspond to iterations of a nested loop rather than to graphics constructs like pixels or vertices.

With these features, you can do things like this: Create a 2D array in GPU memory representing points on a piece of cloth, and global data representing objects or forces acting on the cloth. Dispatch a compute shader that creates a thread for every point in the array. Each thread reads the position and velocity of its point on the cloth, and updates them based on the forces acting on the cloth.

Figure 1: A rather nice carpet, animated by an ES 3.1-style compute shader, has a frightening encounter with a big shiny flying donut. Photo (and demo) courtesy of Sylwester Bala, Mali Demo Team. You can watch the video here.

Indirect Drawing

Indirect drawing sounds even more innocent than the various features that support GPU computing; it just means that the GPU can accept a drawing command whose parameters (such as how many items to draw, and where to find their vertices) are stored in memory, rather than passed as function-call arguments by the CPU. What makes this interesting is that the memory buffer containing the parameters is fully accessible to the GPU – which means that a compute shader can write them. So for example, an application can fire off a compute shader that generates geometry data into a vertex buffer object, and also fills in an indirect drawing command that describes that data. After the compute shader finishes, the GPU can proceed to render the geometry as described in the buffer, without any additional work by the application or the CPU.

Winding Up

There’s other interesting stuff in OpenGL ES 3.1, but I’m out of space to talk about it. By the time you read this, the official specification will be available in the Khronos OpenGL ES registry, and there’ll be lots of information floating around following GDC presentations by myself and my fellow Working Group members. Incidentally, if you’re attending GDC, I hope you’ll stop by the ARM booth or one of our technical talks, and/or come to the Khronos OpenGL ES session, where we’ll walk through the OpenGL ES 3.1 specification in detail.

When will you see ES 3.1 in consumer devices? It’s up to the device makers, of course; but the Khronos conformance test should be up and running by this summer, and the API is meant to run on existing OpenGL ES 3.0 hardware, so it shouldn’t be terribly long. It will certainly be supported* on the ARM Mali Midgard GPUs.

As always – got comments or questions? Drop me a line…

Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics. He spends his working days thinking about what ARM GPUs will be used for in 2016 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.

*Based on the published Khronos specification. Product is expected to pass the Khronos Conformance Testing Process when available. Current conformance status can be found at http://www.khronos.org/conformance.

GDC is always a blast – it brings technologists, gaming enthusiasts, game developers and artists under one roof. At ARM, we are thriving to ensure we have something for everyone attending GDC. My colleagues Gemma Paris and Phill Smith recently published their blogs - ARM and the ARM® Mali™ Ecosystem are ready to rock at GDC!andExpanding the Mobile Experience at GDC 2014 highlighting ARM’s activity line-up at GDC.

To add to our massive line-up of activities at GDC, ARM in collaboration with Samsung Developers and Autodesk are running a unique competition for artists – “The Artists Competition @ GDC”.

Thanks to the innovation by ARM and the partner ecosystem, mobile devices today are extremely powerful, support appropriate tools and interface needed for artists to create designs while maintaining power efficiency that ARM has been always known for. Autodesk’s SketchbookPro application for example, helps artists create designs on-the-go on their Samsung Galaxy Note devices.

Renee Busse (@reneedicherri) from the Autodesk SketchbookPro team created some of these awesome designs on her Note.

Artists, come play and participate in the ongoing artists competition at the ARM Booth #1616. Artists have an opportunity to win a Galaxy Note 8 for a best design chosen at the end of every day through March 19-21 based on designs created on Samsung Galaxy NotePRO 12.2. based on ARM Mali-T628 with Autodesk SketchPro application. The Galaxy Note 8 are sponsored by Samsung Developers.

Look forward to some more cool designs at GDC…

If you're looking for a short summary of the ARM-related news from GDC, look no further than here! The hugely exciting Expo Floor has opened, the attendees are flooding in, the major announcements have all been made and we have a round up of it here just for you:

OpenGL® ES 3.1

OpenGL ES is the API that is used in nearly all the world's mobile devices and on Monday it was announced that the latest version, through its inclusion of features such as compute shaders and indirect drawing, will enable developers to use cutting edge features traditionally reserved for the PC space on mobile devices.Tom Olson, our Director of Graphics Research and more importantly in this matter, Chair of the OpenGL ES Working Group, has already written a great blog celebrating this subject, Here comes OpenGL® ES 3.1! and presented on the subject to a large audience at GDC yesterday in his talk "Getting the most out of OpenGL ES". For those wanting to see a visual example of what OpenGL ES 3.1 can do, ARM released this demo earlier in the week and has been showing it on the booth:

For more information on the OpenGL ES 3.1 API, check out the Khronos Website.

Geomerics Global Illumination technology is going....well, even more global

Whilst the ARM team was settling into its second day at the Smartphone Summit, our yearly warm up for the main booth, Geomerics was at the heart of one of the biggest announcements of the week for worldwide game developers. Unity 5, the award winning, multiplatform engine, announced on Tuesday that amid the number of new features that will make the creation of stunning games that much easier, its key update was the integration of Enlighten™ technology, the real-time lighting system developed by Geomerics. This is fantastic news for the gaming industry as it will enable the vast numbers of developers who use Unity to do things like animate lights, generate emissive material properties and control the environment lighting in real-time. Enlighten is the only real-time global illumination technology optimized to deliver fully dynamic lighting in game on today's PCs, consoles, and mobile platforms.

Geomerics' Transporter demo, being shown on the ARM Booth at GDC, showcases the latest features of Enlighten technology and the beautiful lighting effects that can be achieved on mobile.

Cocos2D-x introduces 3D

The ARM Booth is hosting a great selection of partners this year and one of them, Cocos2D-x, announced last week their new Cocos3D-x Engine. Cocos3d-x is a multi-platform game framework for building 3D games and applications using Objective-C, based on the popular cocos2d-x engine which has played a part in over 1000 games. Its 3D application development framework is for iOS, Android, and MacOSX platforms.

Epic Games releases subscription model

We managed to catch up with Epic Games on their booth and they discussed the new subscription for Unreal Engine 4 - all of its leading-edge tools, features and complete C++ source code are becoming more readily available to the entire development community, no matter what their size, budget or aspiration.

Meanwhile, on the ARM Booth

Yesterday saw the Expo Floor open up to all attendees and the ARM Booth buzzed with activity for the day. The ARM Lecture Theater, which continues over Thursday and Friday, offered valuable development tips and ideas to attendees with many of our partners joining us to give a talk on their mobile gaming solutions. The Gaming Wall saw avid gamers compete to get the fastest lap in the Sports Car Challenge whilst beautiful designs appeared on the Artist Competition wall. All in all it was a great day on the ARM Booth and I will have more news for you tomorrow!

Goo Technologies' Victor Sand wraps up a presentation on the ARM Lecture Theatre

Three attendees discover who is the fastest in the Sports Car Challenge car

Sylwester Bala is on hand to explain the features of Geomerics' Transporter demo

The demos on the ARM wall exemplify how ARM is expanding the mobile gaming experience

In this first blog of a series about GPU Compute we look at the one of the biggest challenges facing the future development and evolution of smart devices.

Isn’t technology wonderful? It’s incredible to think that the processing power in mobile devices has increased 12 times in the last 4 years. Screen resolutions have increased by over 13 times in the same period. And as our smart devices are capable of more and more, we’re doing more and more with them. Study after study shows a continued shift away from desktops and laptops as internet, gaming and entertainment go increasingly mobile. But with all this innovation there’s a problem. An engineering elephant in the room. In the same 4 years, whilst everything else has increased by an order of magnitude, battery technology has only increased by a factor of 2. In engineering terms this presents a massive challenge. All that processing capacity at our finger tips cruelly grabbed away at the last minute.

Processing power information source: McKinsey&Company, “Making smartphones brilliant: ten trends” http://goo.gl/rkSP4

So if we could invent better batteries, we’d be OK, right? Well, although better batteries would be very welcome, sadly it’s not that simple. The bigger problem than battery power alone is the one of thermal dissipation. Not the most glamorous subject maybe – I don’t think anyone wrote a Thermal Dissipation folk song for example – but it’s a critical issue facing mobile engineers today. Put simply, even if we had the power to run our processors harder they would melt because there’s no way to get rid of all the heat they would generate. This elephant is not only getting in the way, he’s about to do something unpleasant in the corner.

So to tackle this issue we have to think long and hard about processing efficiency. One way to do this is to add more CPU cores. Indeed a mixture of faster and more energy efficient cores (big.LITTLE Processing - ARM) allows devices to ramp up and down depending on the demand. But just adding CPU cores doesn’t scale efficiently – after a while we see diminishing returns.

The key to all this – and a very effective way to tackle processing efficiency – is to think heterogeneously. The idea of heterogeneous computing is to spread the computing load not only across multiple processors, but across different types of processor. That involves distributing individual parts of your programme to processors that are best suited to run them. So, for example, general purpose programme flow would sit with the CPU, whilst a complex image processing algorithm might run on a specialist processor designed to cope efficiently with highly parallel workloads.

One such processor is of course the GPU. Designed to process millions of vertices and pixels to create user interfaces, games and applications for modern smart devices, the GPU is a master at doing things efficiently in parallel. Early generations of mobile GPUs were limited to graphics only, but back in November 2012 Google’s Nexus 10 – based on the ARM® Mali™-T604 GPU – became the first mobile device capable of running GPU-accelerated general purpose compute.

Google’s Nexus 10 with Mali-T604 GPU

Since then the true benefit of designing applications to run heterogeneously has been demonstrated time after time. Not only can mobile GPUs speed up certain activities – like image processing, computer vision, video decoding etc. – they can usually do it significantly more efficiently. And using less power to achieve the same thing is all part of tackling that elephant.

But creating applications that make good use of compute on GPUs can be daunting for software engineers used to traditional programming techniques. It not only requires a new way of thinking, but new tools and APIs as well. And understanding the capabilities of the processors at your disposal is a key step to getting the best out of a platform. In this series of blogs we’ll be going into plenty of detail on this brave new elephant-banishing world. We’ll be covering the Mali-T600 and T700 GPU architectures in detail, explaining how they do what they do so you can get the best out of them. We’ll be looking at optimization techniques, software tools and languages that can help you along the way. This will include blogs on Google’s RenderScript, OpenCL™, ARM NEON™ technology, and much more.

So stay tuned for more on the world of compute on Mali GPUs and let us know in the comments any particular areas of interest you would like to us to cover.

If you have a Samsung Chromebook you can try OpenCL on Mali for yourself. Check out this guide on the Malideveloper website: http://malideveloper.arm.com/develop-for-mali/features/graphics-and-compute-development-on-samsung-chromebook/

And if you are interested in RenderScript on the Nexus10, here’s a good place to start: http://developer.android.com/guide/topics/renderscript/compute.html

I will start my blog series with a subseries on technology that I know and come to love best - OpenCL™ from the Khronos Group. OpenCL aims to provide functional portability across computer systems comprised of general-purpose processors (e.g. CPUs) and special-purpose accelerators (e.g. GPUs), to which I will now and then refer to as accelerated systems.

OpenCL is often criticised for presenting a steep learning curve for software engineers but I don't think that's fair: the verbosity of the OpenCL API is really a consequence of its generality. Once you have written a couple of OpenCL programs, you realise that most of the code running on the host processor is actually boilerplate.

I will sidestep this verbosity of OpenCL altogether by using PyOpenCL - a neat Python module written by Andreas Klöckner. (If you are reading this Andreas, keep up the good work!)

Just install PyOpenCL and NumPy - and you are ready to roll!

In [1]:

importpyopenclascl

importnumpyasnp

Trivial example

Suppose we want to create an array containing integers from 0 to 15.

In [2]:

N = 16

That's trivial using NumPy:

In [3]:

np_range=np.arange(N,dtype=np.int32)

np_range

Out[3]:

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int32)

but our PyOpenCL example will fill in a similar array using OpenCL:

In [4]:

cl_range=np.zeros(N,dtype=np.int32)

cl_range

Out[4]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Gimme some context!

Creating a context could hardly be easier:

In [5]:

context=cl.create_some_context()

Ditto creating a command queue:

In [6]:

queue=cl.CommandQueue(context)

Building and running

An OpenCL C program equivalent to np.arange(N) follows:

In [7]:

source='''

kernel void arange(global int * buffer)

{

const size_t gid = get_global_id(0);

buffer[gid] = convert_int(gid);

}

'''

The kernel will be launched as N work-items over a one-dimensional range [0, N-1]. Each work-item will get its unique index gid in the range (that is, an integer between 0 and N-1 inclusive) and write it into argument buffer at offset gid.

Let's build the program:

In [8]:

program = cl.Program(context, source).build()

allocate a memory buffer:

In [9]:

memory_flags = cl.mem_flags.WRITE_ONLY | cl.mem_flags.ALLOC_HOST_PTR

memory = cl.Buffer(context, flags=memory_flags, size=cl_range.nbytes)

launch the kernel:

In [10]:

kernel = program.arange(queue, [N], None, memory)

and copy the results from the buffer to cl_range:

In [11]:

cl.enqueue_copy(queue,cl_range,memory,wait_for=[kernel])

cl_range

Out[11]:

array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int32)

Finally, let's confirm that arrays np_range and cl_range match element-wise:

In [12]:

np.all(np_range == cl_range)

Out[12]:

True

Concluding remarks

I am very interested in getting feedback. I am aiming primarily at software engineers already familiar with OpenCL so I can jump straight to more advanced topics. But if you feel you like a little more explanation, just ask me: I will be happy to explain, or point you to some excellent learning resources out there, or indeed consider filling in details in my later posts...

Also, I prepared this post using the wonderful IPython Notebook environment. I haven't seen much in the way of using PyOpenCL and IPython together, so would be very grateful for any links.

In my next post, I will start discussing how to optimise OpenCL kernels for the ARM® Mali™-T600 GPU series.

ARM has recently published a set of OpenGL® ES extensions. Here we explain some of the background that led us to develop these and show how they can be used to implement some common graphics algorithms more efficiently.

Locality of reference

Many algorithms in computer science can be implemented more efficiently by exploiting locality of reference. That is, efficiency can be gained by making the memory access patterns of an algorithm more predictable. This is also true for computer graphics and is an underlying principle behind the tile-based architectures of the ARM Mali GPUs.

But the locality principle applies beyond tiles. Many graphics algorithms have locality at the level of individual pixels: a value written to a pixel in one operation may be read or modified by a later operation working at the exact same pixel location. Blending is a basic example of this principle that is supported on current GPUs. The value of a pixel is written while rendering a primitive and later read and modified while rendering another primitive. But there are also more complex examples, such as deferred shading, where this principle is not yet exploited. These more complex algorithms require storing multiple values per pixel location, which are finally combined in an application-specific way to produce the final pixel value. On today’s graphics APIs, these algorithms are typically implemented by a multi-pass approach. Pixel values are first written to a set of off-screen render targets, for example using the Multiple Render Target (MRT) support in OpenGL® ES 3.0. In a second pass, these render targets are read as textures and used to compute the final pixel value that is written to the output framebuffer.

One obvious issue with the multi-pass approach is that the intermediate values must be written back to memory. This is far from ideal since keeping memory bandwidth – and thereby power - down is very important for mobile GPUs.

A more efficient approach is possible on the ARM Mali GPUs. As mentioned above, ARM Mali GPUs have a tile-based architecture. As described in a previous blog post by Peter Harris (The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering), this means that we perform fragment shading on 16x16 tiles at a time. All memory required to store the framebuffer values for a tile is stored on-chip until all fragment shading for the tile is complete. This property led us to develop a set of extensions that enables applications to better exploit the locality principle, or what we generally refer to as pixel local storage. The first two extensions, ARM_shader_framebuffer_fetch and ARM_shader_framebuffer_fetch_depth_stencil add the ability to return the current color, depth, and stencil values of a pixel to the fragment shader. The third extension, EXT_shader_pixel_local_storage, enables applications to store custom data per pixel.

Shader Framebuffer Fetch

ARM_shader_framebuffer_fetch enables applications to read the current framebuffer color from the fragment shader. This is useful for techniques such as programmable blending. An example of how this would be used is shown below.

#extension GL_ARM_shader_framebuffer_fetch : enable
precision mediump float;
uniform vec4 uBlend0;
uniform vec4 uBlend1;

void main(void)
{
     vec4 color = gl_LastFragColorARM;     color = max(color, uBlend0, Color.w * uBlend0.w) ;     color *= uBlend1;     gl_FragColor = color;
}

ARM_shader_framebuffer_fetch_depth_stencil additionally allows applications to read the current depth and stencil values from the framebuffer. This enables use-cases such as programmable depth and stencil testing, modulating shadows, soft particles and creating variance shadow maps in a single render pass. Example code for the last two uses-cases is included in the Bandwidth Efficient Graphics with ARM Mali GPUs white paper.

Shader Pixel Local Storage

EXT_shader_pixel_local_storage enables applications to store and retrieve arbitrary values at a given pixel location. This is a powerful principle that enables algorithms such as deferred shading to be implemented without incurring a large bandwidth cost. The amount of storage per pixel is implementation defined, but the extension guarantees that there is storage for at least 16 bytes per pixel.

You will notice that this is an “EXT” extension rather than a vendor-specific “ARM” extension. In OpenGL ES parlance, an “EXT” means multi-vendor. In this case, we worked with other industry players to define the extension, in order to ensure that it works well on their hardware as well as on ours.

So how does it work? Let’s look at a deferred shading example. A typical implementation of this technique using EXT_shader_pixel_local_storage splits the rendering into three passes: a G-Buffer generation pass where the properties (diffuse color, normal, etc.) of each pixel are stored in pixel local storage, a Shading pass where lighting is calculated based on the stored properties and accumulated in pixel local storage, and a Combination pass that uses the values in pixel local storage to calculate the final value of the pixel. These passes are outlined below. For a complete example and further descriptions of the algorithm, refer to the code sample at the Mali Developer Center.

In the G-Buffer generation pass, instead of writing to regular color outputs, the fragment shader would declare a pixel local storage output block:

__pixel_local_outEXT FragData
{     layout(rgba8) highp vec4 Color;     layout(rg16f) highp vec2 NormalXY;     layout(rg16f) highp vec2 NormalZ_LightingB;     layout(rg16f) highp vec2 LightingRG;
} gbuf;

void main()
{
     gbuf.Color = calcDiffuseColor();     vec3 normal = calcNormal();     gbuf.NormalXY = normal.xy;     gbuf.NormalZ_LightingB.x = normal.z;
}

The shader would use this block to store the G-Buffer values in the pixel local storage. The image below illustrates what the contents of the pixel local storage might look like at the end of this pass. Keep in mind that that only one tile’s worth of data would be stored at any given time.

In the Shading pass, the same pixel local storage block would be used to accumulate lighting. In this case, the pixel local storage block would be both read from and written to:

__pixel_localEXT FragData
{     layout(rgba8) highp vec4 Color;     layout(rg16f) highp vec2 NormalXY;     layout(rg16f) highp vec2 NormalZ_LightingB;     layout(rg16f) highp vec2 LightingRG;
} gbuf;

void main()
{
     vec3 lighting = calclighting(gbuf.NormalXY.x,                                  gbuf.NormalXY.y,                                  gbuf.NormalZ_LightingB.x);     gbuf.LightingRG += lighting.xy;     gbuf.NormalZ_LightingB.y += lighting.z;
}

At this point, the contents of the pixel local storage would also include the accumulated lighting (see image below):

Finally, the Combination pass would read from the pixel local storage and calculate the final pixel value:

__pixel_local_inEXT FragData
{     layout(rgba8) highp vec4 Color;     layout(rg16f) highp vec2 NormalXY;     layout(rg16f) highp vec2 NormalZ_LightingB;     layout(rg16f) highp vec2 LightingRG;
} gbuf;

out highp vec4 fragColor;

void main()
{
     fragColor = resolve(gbuf.Color,                         gbuf.LightingRG.x,                         gbuf.LightingRG.y                         gbuf.NormalZ_LightingB.y);
}

We now have our final image (see below) and the pixel local storage is no longer valid.

The important point here is that the pixel local storage data is never written back to memory! The memory for the pixel local storage is kept on-chip throughout and incurs no bandwidth cost. This is significantly more efficient than existing solutions that would require writing 16 bytes of data per pixel for the G-Buffer pass and subsequently read the same amount of data back again in the Shading and Combination passes.

It is also worth pointing out that the above example does not store the depth value in pixel local storage. This is not necessary since ARM_shader_framebuffer_fetch_depth_stencil works well in combination with pixel local storage, effectively increasing the amount of application specific data that can be stored per pixel.

Conclusion

We are very excited about the possibilities opened up by these extensions. These pave the way for algorithms such as deferred shading to be implemented efficiently on mobile GPUs.

And it’s not just about efficiency: these extensions allow you to express the algorithm more directly compared to using an approach based around MRTs. Support for framebuffer fetch from MRTs could avoid some of the bandwidth issues for deferred shading, but would require a more complex implementation. In addition to creating and managing the textures and render buffers for the off-screen render passes, the application would have to provide the appropriate hints, like glInvalidateFramebuffer, to prevent the off-screen render targets from being written to memory. It would also have to rely on clever driver heuristics to avoid the memory being allocated in the first place. Using the extensions presented here, these complexities go away. Everything happens in the fragment shader, allowing you to focus on the core of your algorithm rather than complex state management.

ARM will support these extensions on all GPUs based on the Midgard Architecture. Support for ARM_shader_framebuffer_fetch and ARM_shader_framebuffer_fetch_depth_stencil is also expected to be added to the Mali-400 series of GPUs later this year.

What ideas do you have for using these extensions? We'd be interested in hearing, let us know in the comments below.

Over the first few blogs in this series I have introduced the high level rendering model which the Mali "Midgard" GPU family uses. In the remainder of this series I will explain how to use DS-5 Streamline, a system-level profiling tool from ARM, to identify areas where an application is not getting the best performance out of a Mali-based system.

In this blog we will look at debugging issues around macro-scale pipelining, the means by which we keep the GPU busy all of the time, and some of the common reasons for that frame level pipeline to stall. If you are new to this series I would recommend reading at least the first blog, as it introduces the concepts which we will be investigating in more detail this time around.

Note: I'm assuming you already have DS-5 Streamline up and running on your platform. If you are yet to do this, there are some work guides posted on the community for getting set up on a variety of Mali-based consumer devices.

Google Nexus 10: http://community.arm.com/docs/DOC-8055
Samsung Galaxy Note 3, or Note 10.1: http://community.arm.com/docs/DOC-8110

The examples in this blog were captured using DS-5 v5.16.

What does good content look like?

Before we dive into diagnosing performance problems it is useful to understand what we are aiming for, and what this looks like in Streamline. There are two possible "good" behaviors depending on the performance of the system and the complexity of the content.

One for content where the GPU is the bottleneck
One for content where the vsync is the bottleneck

The counters needed for this experiment are:

Mali Job Manager Cycles: GPU cycles
- This counter increments any clock cycle the GPU is doing something
Mali Job Manager Cycles: JS0 cycles
- This counter increments any clock cycle the GPU is fragment shading
Mali Job Manager Cycles: JS1 cycles
- This counter increments any clock cycle the GPU is vertex shading or tiling

The GPU is the bottleneck

If we successfully create and maintain the frame-level rendering pipeline needed for content where the GPU is the bottleneck (e.g. the rendering is too complex to hit 60 FPS), then we would expect one of the GPU workload types (vertex or fragment processing) to be running at full capacity all of the time.

In nearly all content the fragment processing is the dominant part of the GPU execution; applications usually have one or two orders of magnitude more fragments to shade than vertices. In this scenario we would therefore expect JS0 to be active all of the time, and both the CPU and JS1 to be going idle for at least some of the time every frame.

When using Streamline to capture this set of counters we will see three activity graphs which are automatically produced by the tool, in addition to the raw counter values for GPU. We can see that the "GPU Fragment" processing is fully loaded, and that both the "CPU Activity" and the "GPU Vertex-Tiling-Compute" workloads are going idle for a portion of each frame. Note - you need to zoom in down close to the 1ms or 5ms zoom level to see this - we are talking about quite short time periods here.

The vsync signal is the bottleneck

In systems which are throttled by vsync then we would expect the CPU and the GPU to go idle every frame, as they cannot render the next frame until the vsync signal occurs and a window buffer swap happens. The graph below shows what this would look like in Streamline:

If you are a platform integrator rather than an application developer, testing cases which are running at 60FPS can be a good way to review the effectiveness of your system's DVFS frequency choices. In the example above there is a large amount of time between each burst of activity. This implies that the DVFS frequency selected is too high and that the GPU is running much faster than it needs to, which reduces energy efficiency of the platform as a whole.

Content issue #1: Limited by vsync but not hitting 60 FPS

In a double-buffered system it is possible to have content which is not hitting 60 FPS, but which is still limited by vsync. This content will look much like the graph above, except the time between workloads will be a multiple of one frame period, and the visible framerate will be an exact division of the maximum screen refresh rate (e.g. a 60 FPS panel could run at 30 FPS, 20 FPS, 15 FPS, etc).

In a double-buffered system which is running at 60 FPS the GPU successfully manages to produce frames in time for each vsync buffer swap. In the figure below we see the lifetime of the two framebuffers (FB0 and FB1), with periods where they are on-screen in green, and periods where they are being rendered by the GPU in blue.

In a system where the GPU is not running fast enough to do this, we will miss one or more vsync deadlines, so the current front-buffer will remain on screen for another vsync period. At the point of the orange line in the diagram below the front-buffer is still being displayed on the screen, and the back-buffer is queued for display, the GPU has no more buffers to render on to and goes idle. Our performance snaps down to run at 30 FPS, despite having a GPU which is fast enough to run the content at over 45 FPS.

The Android windowing system typically uses triple buffering, so avoids this problem as the GPU has a spare buffer available to render on to, but this is still seen in some X11-based Mali deployments which are double buffered. If you see this issue it is recommended that you disable vsync while doing performing optimization; it is much easier to determine what needs optimizing without additional factors clouding the issue!

Content issue #2: API Calls Break the Pipeline

The second issue which you may see is a pipeline break. In this scenario at least one of the CPU or GPU processing parts are busy at any point, but not at the same time; some form of serialization point has been introduced.

In the example below the content is fragment dominated, so we would expect the fragment processing to be active all the time, but we see an oscillating activity which is serializing GPU vertex processing and fragment processing.

The most common reason for this is the use of an OpenGL ES API function which enforces the synchronous behavior of the API, forcing the driver to flush all of the pending operations and drain the rendering pipeline in order to honor the API requirements. The most common culprits here are:

glFinish(): explicitly request a pipeline drain.
glReadPixels(): implicitly request a pipeline drain for the current surface.
glMapBufferRange() without GL_MAP_UNSYNCHRONIZED_BIT set: explicit pipeline drain for all pending surfaces using the data resource being mapped.

It is almost impossible to make these API calls fast due to their pipeline draining semantics, so I would suggest avoiding these specific uses wherever possible. It is worth noting that OpenGL ES 3.0 allows glReadPixels to target a Pixel Buffer Object (PBO) which can do the pixel copy asynchronously. This no longer causes a pipeline flush, but may mean you have to wait a while for your data to arrive, and the memory transfer can still be relatively expensive.

Content issue #3: Not GPU limited at all

The final issue I will talk about today is one where the GPU is not the bottleneck at all, but which often shows up as poor graphics performance.

We can only maintain the pipeline of frames if the CPU can produce new frames faster than the GPU consuming them. If the CPU takes 20ms to produce a frame which the GPU takes 5ms to render, then the pipeline will run empty each frame. In the example below the GPU is going idle every frame, but the CPU is running all of the time, which implies that the CPU cannot keep up with the GPU.

"Hang on" I hear you say, "that says the CPU is only 25% loaded". Streamline shows the total capacity of the system as 100%, so if you have 4 CPU cores in your system with one thread maxing out a single processor then this will show up as 25% load. If you click on the arrow in the top right of the "CPU Activity" graph's title box it will expand giving you separate load graphics per CPU core in the system:

As predicted we have one core maxed at 100% load, so this thread is the bottleneck in our system which is limiting the overall performance. There can be many reasons for this, but in terms of the graphics behavior rather than application inefficiency, the main two are:

Excessive amounts of glDraw...() calls
Excessive amounts of dynamic data upload

Every draw call has a cost for the driver in terms of building control structures and submitting them to the GPU. The number of draw calls per frame should minimized by batching together drawing of objects with similar render state, although there is a balance to be struck between larger batches and efficient culling of things which are not visible. In terms of a target to aim for: most high-end 3D content on mobile today uses around 100 draw calls per render target, with many 2D games coming in around 20-30.

In terms of dynamic data upload be aware that every data buffer uploaded from client memory to the graphics server requires the driver to copy that data from a client buffer into a server buffer. If this is a new resource rather than sub-buffer update then the driver has to allocate the memory for the buffer too. The most common offender here is the use of client-side vertex attributes. Where possible use static Vertex Buffer Objects (VBOs) which are stored persistently in graphics memory, and use that buffer by reference in all subsequent rendering. This allows you to pay the upload cost once, and amortize that cost over many frames of rendering.

It some cases it may not be Mali graphics stack which is limiting the performance at all. We do sometimes get support cases where the application logic itself is taking more than 16.6ms, so the application could not hit 60 FPS even if the OpenGL ES calls were infinitely fast. DS-5 Streamline contains a very capable software profiler which can help you identify precisely where the bottlenecks are in your code, as well as helping you load balance workloads across multiple CPU cores in your system if you want to parallelize your software using multiple threads, but as this is not directly related to the Mali behavior I'm not going to dwell on it this time around.

Next Time ...

Next time I will be reviewing the Mali driver's approach to render target management, and how to structure your application's use of Frame Buffer Objects (FBOs) to play nicely with this model.

Comments and questions welcome,

Cheers,

Pete

April 1, 2014 was the date for the Israel Machine Vision Conference (IMVC) in Tel Aviv. I’m always slightly wary of attending events held on April 1. I never know for sure that after all the queuing, checking in, travelling, waiting for bags, finding the hotel, the venue, someone doesn’t just say “April Fool!” when you get there. Well, not to worry… IMVC was very real. It’s the annual get-together for the prolific Israeli computer vision community and packed in a day of fascinating talks about the latest developments in this exciting subject. It was great to see so much innovation in evidence and to hear from companies large and small working in this area.

As with many areas of technology there was much talk about mobile. Computer vision in mobile devices is a hot topic, particularly as the energy efficient yet powerful processors required are quickly coming of age. Roberto Mijat and I were there to talk about ARM’s central role in this area and in particular the advantages of using the GPU to offload some of the processing required to enable these sorts of features.

In full flow talking about GPU Compute on Mali

Devices containing the ARM® Mali™-T600 series of GPUs have been providing general purpose compute capabilities for a couple of years now and there are many examples of the benefits of using the GPU for both graphics and non-graphics computation. I showcased a few of these in my talk, including GPU-accelerated gesture recognition from Israeli company eyeSight® Technologies and face detection and analysis from PUX (Panasonic), both of which have been optimised to run on the Mali-T604 GPU using OpenCL™. In these and many other cases we see the GPU making sufficient difference to enable computer vision algorithms to run in real time. Better still the GPU gives us the additional compute bandwidth which allows the use of more sophisticated algorithms that have been shown to enhance the user experience significantly. eyeSight’s low-light gesture detection is a great example. And equally as important is that we can do all this whilst burning much less energy – a crucial requirement for mobile devices.

eyeSight's gesture recognition in action (as shown at CES 2014)

Another area of discussion – both in my talk and elsewhere at the conference – compared the different ways of achieving computer vision on mobile. As well as using GPUs, ARM’s CPU processor technology already offers heterogeneous features through big.LITTLE™ and NEON™ technology, and there are custom DSPs designed for specific image processing jobs that can sit alongside the ARM CPU. A DSP is hard to beat when it comes to area and power, but the downside is its lack of flexibility. As new algorithms come along you need new DSPs – and this is where the programmable GPU really scores. It allows existing hardware to take on powerful new capabilities.

Demonstrating PUX face detection and analysis demo at IMVC

We met with many interesting companies and discussed some compelling new computer vision use cases boding well for what we’ll see emerging over the next few months. The conference ended with some interesting stargazing from both Google and Microsoft. And then it was all over for another year. Our hosts and conference organiser was SagivTech Ltd, a company dedicated to computer vision research, development and education. Our thanks to them for inviting us along, and for organising such a great event.

For more information, please visit:

http://www.imvc.co.il/

http://community.arm.com/groups/arm-mali-graphics/blog/2013/10/06/improved-gesture-detection-with-mali-gpu-compute

The Mobile Computing Benchmark Suite

Mobile device manufacturers as well as consumers have shown growing interest for greater computing power on mobile devices. This processing power is necessary for gaming, augmented reality, image and video enhancements and other compute intensive applications that run on mobile devices.

One way of increasing mobile device processing power is better exploitation of the massive computing power offered by on-board Graphic Processing Units (GPUs), which can be used to carry out compute- intensive tasks with relatively low power consumption.

Objective measurement of key performance indicators has always been crucial to better understanding how a given platform performs. The best method of achieving these key performance indicators may vary from vendor to vendor, due to different hardware designs and software stack limitations. This complicates the process of evaluating different types of devices and deciding which platform is faster and to what degree – and this complexity accentuates even more the need for benchmarking.

For all these reasons, SagivTech created a MobileComputing Benchmark Suite–to establish reliable performance metrics for common mobile computing operations and the expected performance of GPU code on specific hardware. This post will take a closer
look at two of the primary benchmarks we included in our suite – bandwidth and FLOPs. This post will show the sustained results of those two crucial performance aspects, measured on the ARM Mali T628MP6 GPU found in the Samsung Galaxy Note 10.1 device.

Bandwidth Benchmarks

The bandwidth test measures how fast the input data (once in memory visible to the GPU) can reach the GPU’s ALUs for processing. This is usually measured in Gigabytes of data per second. An algorithm is considered to be bandwidth-limited if the portion
of the code that takes most of the time for the algorithm to complete is loading and storing data between memory and the ALUs. The faster the data can reach the ALUs, the faster the algorithm work, and faster the user will see the output.

Currently, SagivTech’s Bandwidth Benchmark tests use only float-type variables, with the four sub-tests using float, float2, float4 and float8 variables. The test kernel code is very simple - just copying one variable from the input buffer to the output buffer.

Figure 1 shows the measured GB/s on ARM’s Mali T628MP6 GPU.

FLOPs Benchmark

While a bandwidth bounded algorithm is limited by the time spent in its memory subsystem, a compute bound algorithm is limited by the time spent by the GPU doing math computation. A compute bound algorithm is usually measured by means of FLOPs –
measuring how many floating point operations can be done in one second. For example, A = b + c, would count for one FLOP, A = b + c + d, would count for two FLOPS. On most common GPU implementations, peak performance is calculated by
counting an FMA operation, A = b + c * d, as two FLOPs, even though most platforms can calculate this in one clock.

For compute bounded algorithms running on the GPU, the higher the peak theoretical and sustained FLOPs the device can achieve, the faster the algorithm will run. As with the bandwidth case, achieving a high FLOPs rate in real-world applications
can be very challenging, and the method of achieving this can significantly vary from platform to platform.

SagivTech’s FLOPs Benchmark calculates the peak sustained FLOPs a compute enabled GPU can achieve. In order to achieve this, SagivTech created a set of test scenarios to see how each platform behaves under different types of load and usage, and how
much of the peak performance of the GPU each scenario actually achieves.

The table below details all the scenarios we tested in the FLOPs Benchmark. The table shows all the parameters used to calculate how many FLOPs each test yielded on the platform at hand. Of course, these tests are not real-world - they were defined to demonstrate
how peak performance FLOPs are calculated and how well each such scenario maps to the current platform.

#	Operations	Inst #	Calculation	Var type	FLOPs
1	Simple add	1	val1 = value1 + val1	float	1 * 1 * 1 = 1
2		4	val1 = value1 + val1 val2 = value1 + val2 val3 = value1 + val3 val4 = value1 + val4	float	4 * 1 * 1 = 4
3	MADD	1	val1 = value1 + val1 * value2	float	1 * 2 * 1 = 2
4		4	val1 = value1 + val1 * value2 val2 = value1 + val2 * value2 val3 = value1 + val3 * value2 val4 = value1 + val4 * value2	float	4 * 2 * 1 = 8

More coding techniques have been employed to make sure the test results are consistent, such as manual unrolling of the loop inside the kernel, averaging kernel timings by running the kernel in a loop on the host etc.

Below is a sample kernel code for scenario #4 in the above table, while using float4 as the variable type:

FLOPs Test Kernel

__kernel void kernelFlopsTest(__global float4 *data, int iterations)
{
     int tid = get_global_id(0);
     float4 val1, val2, val3, val4, value1, value2;
     val1 = data[tid];
     value1.x = val1.w; value1.y = val1.z; value1.z = val1.y; value1.w = val1.x;
     value2.x = val1.w; value2.y = val1.w; value2.z = val1.y; value2.w = val1.y;
     val2.x = val1.w; val2.y = val1.z; val2.z = val1.y; val2.w = val1.x;
     val3 = value1; val4 = value2;
     for (int j = 0 ; j < iterations; ++j)
     {
         val1 = value1 + val1 * value2; val2 = value1 + val2 * value2;
         val3 = value1 + val3 * value2; val4 = value1 + val4 * value2;
         /* additional 98 times of the above two lines */

val1 = value1 + val1 * value2; val2 = value1 + val2 * value2;

val3 = value1 + val3 * value2; val4 = value1 + val4 * value2;

}

data[tid] = (val1 + val2 + val3 + val4);

};

Figure 2 shows the measured GFLOP/s on ARM Mali-T628MP6 GPU

Wrapping Up

SagivTech created the Mobile Computing Benchmark Suite to make it easier to evaluate how well new hardware might perform when running heavy computing tasks. The first steps in this evaluation are to measure the GPU’s maximum sustained memory
subsystem performance and its floating arithmetic capabilities. Comparing these sustained performance numbers to the theoretical peaks that the device is supposed to yield can help in identifying device strengths, capabilities and limitations.

To learn more about SagivTech’s Mobile Computing Benchmark Suite and the full results of our initial benchmarking, please download the Mobile Computing Benchmark Suite White Paper.

About SagivTech

SagivTech is a leading and veteran provider of innovative technology, solutions and services for GPU computing and computer vision, and a recognized source of expertise in image and signal processing algorithms and software development for parallel computing platforms. For more information, please visit www.sagivtech.com.

Disclaimer

Reference to any specific commercial product, process or service by its trade name, trademark, and manufacturer or otherwise, does not constitute or imply its endorsement or recommendation by us. Any trademark reference belongs to its owner and we make no claim as to its use or ownership and will use it only to truthfully and accurately identify such product or service.

WE WILL NOT BE LIABLE TO YOU OR ANYONE ELSE FOR ANY LOSS OR DAMAGES OF ANY KIND (INCLUDING, WITHOUT LIMITATION, FOR ANY SPECIAL, DIRECT, INDIRECT, INCIDENTAL, EXEMPLARY, ECONOMIC, PUNITIVE, OR CONSEQUENTIAL DAMAGES) IN CONNECTION WITH THE DATA OR YOUR USE THEREOF OR RELIANCE THEREUPON, EVEN IF FORESEEABLE OR EVEN IF WE HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES (INCLUDING, WITHOUT LIMITATION, WHETHER CAUSED IN WHOLE OR IN PART BY NEGLIGENCE, GROSS NEGLIGENCE, OR OTHERWISE, BUT EXCLUDING WILLFUL MISCONDUCT) UNLESS SPECIFIED IN WRITING. OUR TOTAL AND AGGREGATE LIABILITY IN CONNECTION WITH THE DATA OR YOUR USE THEREOF OR RELIANCE THEREUPON WILL NOT EXCEED USD $100.

YOUR USE AND RELIANCE UPON THE DATA IS AT YOUR RISK. IF YOU ARE DISSATISFIED WITH THE DATA OR ANY OF THE INFORMATION, YOUR SOLE AND EXCLUSIVE REMEDY IS TO DISCONTINUE USE OR RELIANCE OF ON THE DATA.

YOU ACKNOWLEDGE AND AGREE THAT IF YOU INCUR ANY DAMAGES THAT ARISE OUT OF YOUR USE OF THE DATA OR RELIANCE THEREUPON, THE DAMAGES, IF ANY, ARE NOT IRREPARABLE AND ARE NOT SUFFICIENT TO ENTITLE YOU TO AN INJUNCTION OR OTHER EQUITABLE RELIEF RESTRICTING EXPLOITATION OF ANY DATA, PRODUCT, PROGRAM, OR OTHER CONTENT OWNED OR CONTROLLED BY US.

Last month I was at Game Developer Conference (GDC) where I had a fabulous time attending various talks and roundtables, visiting exhibitors and I had a particularly good time showing and explaining to people the latest technologies developed within ARM, such as ASTC 3D HDR textures and Transaction Elimination, as well as compute shaders.

With regards to the last one, many of you have been curious about how to get this piece of technology incorporated into your software. With that in mind, I decided to write this blog to help you write a simple program with compute shaders. I hope this blog will help you to create more advanced applications based on this technology.

So, what are compute shaders? Compute shaders introduce heterogeneous GPU Compute from within the OpenGL® ES API; the same API and shading language which are used for graphics rendering. Now that compute shaders have been introduced to the API, developers do not have to learn another API in order to make use of GPU Compute. The compute shader is just another type of shader in addition to the already broadly known vertex and fragment shaders.

Compute shaders give a lot of freedom to developers to implement complex algorithms and make use of GPU parallel programming. Although the contemporary graphics pipeline is very flexible, developers still tend to stumble on some restrictions. The compute shaders feature, however, makes life easier for us to not think about pipeline stages as we are used to thinking about vertex and fragment. We are no longer restricted by the inputs and outputs of certain pipeline stages. The Shader Storage Buffer Object (SSBO) feature for instance has been introduced along with compute shaders and that gives additional possibilities for exchanging data between pipeline stages, as well as being flexible input and output for compute shaders.

Below you can find a simple example of how to implement compute shaders within your application. The example calculates a coloured circle with a given radius; the radius is a uniform parameter passed by the application and is updated every frame in order to animate the radius of the circle. The whole circle is drawn using points, which are stored as vertices within a Vertex Buffer Object (VBO). The VBO is mapped onto SSBO (without any extra copy in memory) and passed to the compute shader.

Let’s start by writing the OpenGL ES Shading Language (ESSL) compute shader code first:

#version 310 es

// The uniform paramters which is passed from application for every frame.

uniform float radius;

// Declare custom data struct, which represents either vertex or colour.

struct Vector3f

{

float x;

float y;

float z;

float w;

};

// Declare the custom data type, which represents one point of a circle.

// And this is vertex position and colour respectively.

// As you may already noticed that will define the interleaved data within

// buffer which is Vertex|Colour|Vertex|Colour|…

struct AttribData

{

Vector3f v;

Vector3f c;

};

// Declare input/output buffer from/to wich we will read/write data.

// In this particular shader we only write data into the buffer.

// If you do not want your data to be aligned by compiler try to use:

// packed or shared instead of std140 keyword.

// We also bind the buffer to index 0. You need to set the buffer binding

// in the range [0..3] – this is the minimum range approved by Khronos.

// Notice that various platforms might support more indices than that.

layout(std140, binding = 0) buffer destBuffer

{

AttribData data[];

} outBuffer;

// Declare what size is the group. In our case is 8x8, which gives

// 64 group size.

layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;

// Declare main program function which is executed once

// glDispatchCompute is called from the application.

void main()

{

// Read current global position for this thread

ivec2 storePos = ivec2(gl_GlobalInvocationID.xy);

// Calculate the global number of threads (size) for this

uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x;

uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y;

uint gSize = gWidth * gHeight;

// Since we have 1D array we need to calculate offset.

uint offset = storePos.y * gWidth + storePos.x;

// Calculate an angle for the current thread

float alpha = 2.0 * 3.14159265359 * (float(offset) / float(gSize));

// Calculate vertex position based on the already calculate angle

// and radius, which is given by application

outBuffer.data[offset].v.x = sin(alpha) * radius;

outBuffer.data[offset].v.y = cos(alpha) * radius;

outBuffer.data[offset].v.z = 0.0;

outBuffer.data[offset].v.w = 1.0;

// Assign colour for the vertex

outBuffer.data[offset].c.x = storePos.x / float(gWidth);

outBuffer.data[offset].c.y = 0.0;

outBuffer.data[offset].c.z = 1.0;

outBuffer.data[offset].c.w = 1.0;

}

Once the compute shader code has been written, it is time to make it work in our application. Within the application you need to create a compute shader, which is just a new type of shader (GL_COMPUTE_SHADER), and the other calls related to the initialisation remain the same as for vertex and fragment shaders. See below for a snippet of code which creates the compute shader and also checks for both compilation and linking errors:

// Create th compute program, to which the compute shader will be assigned

gComputeProgram = glCreateProgram();

// Create and compile the compute shader

GLuint mComputeShader = glCreateShader(GL_COMPUTE_SHADER);

glShaderSource(mComputeShader, 1, computeShaderSrcCode, NULL);

glCompileShader(mComputeShader);

// Check if there were any issues when compiling the shader

int rvalue;

glGetShaderiv(mComputeShader, GL_COMPILE_STATUS, &rvalue);

if (!rvalue)

{

glGetShaderInfoLog(mComputeShader, LOG_MAX, &length, log);

printf("Error: Compiler log:\n%s\n", log);

return false;

}

// Attach and link the shader against to the compute program

glAttachShader(gComputeProgram, mComputeShader);

glLinkProgram(gComputeProgram);

// Check if there were some issues when linking the shader.

glGetProgramiv(gComputeProgram, GL_LINK_STATUS, &rvalue);

if (!rvalue)

{

glGetProgramInfoLog(gComputeProgram, LOG_MAX, &length, log);

printf("Error: Linker log:\n%s\n", log);

return false;

}

So far we have created the compute shader on the GPU. Now we need to set up handlers, which will be used for setting up inputs and outputs for the shader. In our case we need to retrieve the radius uniform handle and set the gIndexBufferBinding (the integer variable) to 0, as the binding was hardcoded within binding = 0. Using this index we will be able to bind the VBO to that index and write data from within the compute shader to the VBO:

// Bind the compute program in order to read the radius uniform location.

glUseProgram(gComputeProgram);

// Retrieve the radius uniform location

iLocRadius = glGetUniformLocation(gComputeProgram, "radius");

// See the compute shader: “layout(std140, binding = 0) buffer destBuffer”

gIndexBufferBinding = 0;

Okay, so far so good. Now we are ready to kick off the compute shader and write data to the VBO. The snippet of code below shows how to bind the VBO to the SSBO and submit a compute job to the GPU:

// Bind the compute program

glUseProgram(gComputeProgram);

// Set the radius uniform

glUniform1f(iLocRadius, (float)frameNum);

// Bind the VBO onto SSBO, which is going to filled in witin the compute

// shader.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, gVBO);

// Submit job for the compute shader execution.

// GROUP_SIZE_HEIGHT = GROUP_SIZE_WIDTH = 8

// NUM_VERTS_H = NUM_VERTS_V = 16

// As the result the function is called with the following parameters:

// glDispatchCompute(2, 2, 1)

glDispatchCompute(

(NUM_VERTS_H % GROUP_SIZE_WIDTH + NUM_VERTS_H) / GROUP_SIZE_WIDTH,

(NUM_VERTS_V % GROUP_SIZE_HEIGHT + NUM_VERTS_V) / GROUP_SIZE_HEIGHT,

1);

// Unbind the SSBO buffer.

// gIndexBufferBinding is equal to 0 (same as the compute shader binding)

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, gIndexBufferBinding, 0);

As you may have already noticed, for the glDispatchCompute function we pass the number of groups rather than number of threads to be executed. In our case we execute 2x2x1 groups, which gives 4. However the real number of threads (kernels) executed will be 4 x [8 x 8] which results with the number of 256 threads. The numbers 8x8 come from the compute shader source code, as we hardcoded those numbers within the shader.

So far we have written the compute shader source code, compiled, linked, initialised handlers and dispatched the job for compute. Now it’s time to render the results on screen. However, before we do that we need to remember that all jobs are submitted and executed on the GPU in parallel, so we need to make sure the compute shader will finish the job before the actual draw command starts fetching data from the VBO buffer, which is updated by the compute shader. In this example you won't see much difference in runtime with and without synchronisation but once you implement more complex algorithms with more dependencies, you may notice how important it is to have synchronisation.

// Call this function before we submit a draw call, which uses dependency

// buffer, to the GPU

glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);

// Bind VBO

glBindBuffer( GL_ARRAY_BUFFER, gVBO );

// Bind Vertex and Fragment rendering shaders

glUseProgram(gProgram);

glEnableVertexAttribArray(iLocPosition);

glEnableVertexAttribArray(iLocFillColor);

// Draw points from VBO

glDrawArrays(GL_POINTS, 0, NUM_VERTS);

In order to present the VBO results on screen you can use vertex and fragment programs, which are shown below.

Vertex shader:

attribute vec4 a_v4Position;

attribute vec4 a_v4FillColor;

varying vec4 v_v4FillColor;

void main()

{

v_v4FillColor = a_v4FillColor;

gl_Position = a_v4Position;

}

Fragment shader:

varying vec4 v_v4FillColor;

void main()

{

gl_FragColor = v_v4FillColor;

}

I think that’s all for this blog and hopefully I will be able to cover more technical details in the future. I believe you will find compute shaders friendly and easy to use in your work. I personally enjoyed implementing the Cloth Simulation demo, one of ARM’s latest technical demos, which was released at GDC. The important thing in my view is that now, once a developer is used to OpenGL ES, it is easy to move on to GPU Compute using just one API. More than that, exchanging data between graphics and compute buffers appears to be done in a clean and transparent way for developers. You shouldn’t limit your imagination to this blog’s application of how you might want to use compute shaders - this blog is only to help developers learn how to use them. I personally can see a real potential in image processing, as you can implement algorithms that will be executed on the chip using internal memory, which must reduce traffic on the bus between memory and chip.

You can also have a look at our latest Cloth Simulation demo, which has been implemented with compute shaders. See the video below:

In his book "How music works", David Byrne points out that music is created to fit a given context: music that would sound great in a symphony hall would likely sound unconvincing in a stadium. Similarly, OpenCL™ programs are often created with a particular context in mind. In particular, OpenCL kernels are often optimised for a particular compute device (e.g. a GPU in the programmer's desktop machine).

I am continuing my blog series by presenting the fundamentals of the ARM Midgard architecture underpinning the ARM® Mali™-T600 and Mali™-T700 GPU series.

Architecture Overview

The number of Mali cores in a system-on-chip (SoC) can be scaled to satisfy the performance requirements for this SoC. For example, the Mali-T624 can be scaled from one to four cores. Each core contains a tri-pipe consisting of two arithmetic (A) pipelines, one load-store (LS) pipeline and one texturing (T) pipeline:

Thus, the peak throughput of each core is two A instruction words, one LS instruction word and one T instruction word per cycle.

Try as I might, I cannot refer the Midgard architecture to a single class:

Midgard is a Very Long Instruction Word (VLIW) architecture, such that each pipe contains multiple units and most instruction words contain instructions for multiple units.
Midgard is also a Single Instruction Multiple Data (SIMD) architecture, such that most instructions operate on multiple data elements packed in 128-bit vector registers.
Finally, Midgard is a Fine-Grain Multi-Threaded (FGMT) architecture, such that each core runs its threads in a round-robin fashion, on every cycle switching to the next ready-to-execute thread. What's interesting, each thread has its individual program counter (unlike warp-based designs, where threads in a warp share the same program counter).

Guidelines for Optimising Compute Kernels

So what do the Midgard architectural features actually mean for optimising compute kernels? I recommend:

Having sufficient instruction level parallelism in kernel code to allow for dense packing of instructions into instruction words by the compiler. (This addresses the VLIW-ness of the architecture.)
Using vector operations in kernel code to allow for straightforward mapping to vector instructions by the compiler. (I will have much more to say on vectorisation later, as it's one of my favourite topics.)
Having a balance between A and LS instruction words. Without cache misses, the ratio of 2:1 of A-words to LS-words would be optimal; with cache misses, a higher ratio is desirable. For example, a kernel consisting of 15 A-words and 7 LS-words is still likely to be bound by the LS-pipe.
Using a sufficient number of concurrently executing (or active) threads per core to hide the execution latency of instructions (which is the depth of a corresponding pipeline). The maximum number of active threads I is determined by the number of registers R that the kernel code uses: I = 256, if 0 < R≤ 4; I = 128, if 4 < R≤ 8; I = 64, if 8 < R≤ 16. For example, kernel A that uses 5 registers and kernel B that uses 8 registers can both be executed by running no more than 128 threads per core. This means that it may be preferable to split complex, register-heavy kernels into a number of simpler ones. (For compiler folk among us, this also means that the backend may decide to spill a value to memory rather than use an extra register when its heuristics suggest that the number of registers to be likely required is approaching 4 or 8.)

In some respects, writing high performance code for the Mali GPUs embedded in SoCs is easier than for GPUs found in desktop machines:

The global and local OpenCL address spaces get mapped to the same physical memory (the system RAM), backed by caches transparent to the programmer. This often removes the need for explicit data copying and associated barrier synchronisation.
Since all threads have individual program counters, branch divergence is less of an issue than for warp-based architectures.

Fasten your Seat Belt!

With this theoretical knowledge under the belt, we can now look at optimising some kernels! Can you guess where we'll start?.. (Or use comments to vote for your favourite kernel!)