Nothing but Static

October 22, 2013, 7:44 am

At GDC 2013 I gave a presentation called Nothing but Static on the ARM booth. Attendance was disappointing, in part because I showed up late, but also because people were understandably nonplussed at the promise of a tech talk on making better use of static geometry in graphical applications. I shot myself in the foot with what I thought was a very clever title.

The real aim of the talk was to show how static geometry and textures could be used to create dynamic effects, reducing the bandwidth needed to animate a living environment. This is important because bandwidth is a constant worry in mobile graphics. A colleague of mine, Ed Plowman, regularly talks about the fact that if you map the raw compute power of GPUs over time, the ones in the mobile space scale along the same curve that console and desktop GPUs improved at before. If you chart available bandwidth over time, the curves diverge as desktop GPUs developed into power sucking monstrosities with their own array of fans and heat pipes. Bandwidth is almost entirely linked to power and liquid-cooling systems are a place mobile devices can’t afford to go.

With some older engines and applications, if geometry was animated, the animation had to be done on the CPU, and the entire mesh was re-sent to the GPU every frame. This was because the early mobile GPUs only supported OpenGL® ES 1.1, and with fixed function pipelines there was no way to manipulate the mesh after it had been sent. When OpenGL ES 2.0 was launched, it opened a second possibility, skeletal animation on the GPU. You have an additional attribute to each vertex, stating which bone it was aligned to, and then a number of transformations are sent in an array, so the GPU can look up the relevant transformation for each vertex based on its bone ID.

This way the vertices can be stored in a vertex buffer object (VBO) and never change on the GPU and the only information being sent per frame is the uniform array of bone transformations. This is certainly more efficient than animating on the CPU but it still has drawbacks in some cases. If you imagine an example mesh of a human figure, the legs, torso, arms and head would all move independently, meaning a total of 10 bones. Since you’ll be doing nothing strange in the projection part of the matrix you can get away with a 4x3 matrix per bone, but even that means sending 120 floating point values per frame for a single model. If you need the model animated in a very specific way, such as character animation, this is still the most efficient method. Other types of animation needn’t use such restrictions.

In the recent Seemore demo we showed a greenhouse with a monstrous plant growing from the centre of the floor, and dotted around it were writhing tentacles like vines, also bursting through the ground. If we’d wanted to animate each of them with its own skeleton we’d have had to carefully trade off the number of bones against the resolution of the movement. Getting a fluid rippling motion over a skeleton is difficult, probably because all the things in real life that move that way (like snakes and cephalopods) have either a huge number of bones or no bones at all.

So, for the tentacle we applied a sine wave, with the phase based on the model-space Y coordinate shifting the tentacle’s vertices in its X and Z coordinate. There’s a little more to it than a single lateral shift, there was also the matter of setting the surface normals and tangents based on the cosines of the same equation, as well as tilting the mesh around the central axis so it looked like it was really curling around, not just being skewed. The effect was applied along with a few other curvature and pulse equations, all of which had their magnitude increased by distance from the ground, so the base never moved away from the hole in the floor we’d made for it. The total per-frame data was a single vec4 for each tentacle, with the X and Y making the tentacle lean in a given direction (there had originally been plans to have them attempt to touch the player but it was already freaky enough with them just wiggling) and the Z and W controlling the phase of the wriggling and the pulse.

A similar compound curve was applied to the tongue and the stem of the plant to give it a nice S-bend with a pair of circular curves of slowly shifting tightness.

Normally when I present this information, this is the point at which someone has pointed out that not everything can just wriggle, and real world applications rely upon quite rigid constraints on the animation, which is why I quickly move on to a second example: the pages of a book in the Gesture demo we produced much earlier. The interface was intended to feel as tangible as possible and so interactive elements had to move correctly. One such element was a big thick book which users could open and flip through the pages.

A lot of virtual books have the pages as a pair of solid blocks which hinge in the middle and I can only surmise from this that a lot of graphics coders have never really looked at a book. A block of pages in a book will never open to a flat surface, particularly if the book has a wide stiff spine. What actually happens is that the spine bends some of the way and the rest of the bend comes from the pages, which curve out from the middle a short way then lay flat, ending in a chiselled face to the page edges. Getting the book to open in this way would be highly impractical in a skeletal animation, but provided the mesh has sufficient vertex resolution, the page curvature can be easily calculated algorithmically by rotating the points around a cylindrical section, the outer radius of which becomes the offset for the chisel on the end. A single uniform float controlled the curvature of the pages in this way, from fully closed to fully open.

A similar equation was then used to animate a single page surface, rising up from this tightly curved rest position, into a more relaxed curve, which then inverted as the page turned over, finally landing flat in the rest position in the opposite size. Of course the fact that it started and ended flush with the pages allowed the textures to be parameterised, so that during the animation one half of the book was the next page, one was the previous, and the page turning between had a page either side, and was only rendered during the page turn animation, giving the perfect illusion of the pages turning one at a time from a solid block of pages.

On ARM® Mali™ hardware supporting OpenGL ES 3.0 the vertex shader can access texture sampling functions, which mean that algorithmic animation of vertices can use textures as input to give less mathematical results, such as ocean waves or deformable terrain.

Algorithmic animation is not limited to the vertex shader. There’s no end to the number of weird and wonderful effects possible with a little ingenuity in the fragment shader. Combinations of sine waves with a time controlled phase value can represent anything from a rippling pond to an electric plasma arc. My personal favourite effect in this vein is the dust particle animation from the Timbuktu demo.

For this effect each particle was a camera aligned billboard with texture coordinates ranging from (2, 2) to (-2, -2). This curious set up meant we could do a quick r = x²+y² and figure out if the fragment was inside a unit circle and discard any that weren’t. Following on with sqrt(1-r) we get the third dimension of the surface of a unit sphere inside that circle’s screen space. Since it’s a unit circle, that makes it the surface normal of that point on the sphere also and converted to world space we could then do lighting equations to shade it as a sphere and use the dot product to the camera vector to fade it at the edges, like a perfectly round cloud in the middle of the billboard.

That effect itself is no big deal, but the magic happens when you have a noisy texture added into the mix. Sample the texture based on that original X and Y, as well as a time based offset value, and use the resulting red and green to offset the X and Y before performing the sphere calculation. What this gives you is a noisy cloud which seems to flow in time. Since this is a particle, it also fades over time and in the course of that fade, the distortion is increased, making the cloud expand, billow and dissipate in a noisy looking organic way.

The great thing about this technique is that by changing the texture, the cloud looks different. The same algorithm can produce tight noisy clouds of dust or soft billowy clouds of steam. It can even distort more in a specific direction to look like a thin wispy vapour.

All these techniques are described in the GDC talk I gave and later it was combined with my second presentation from the same event about draw-call batching into a far more attractively titled video called “Dynamic Duo”. For reasons unfathomable to me it is most often referred to as “Optimised Effects for Mobile” and you can find it on www.malideveloper.arm.com,or you can watch it right now:

If you’d like to talk about any of the techniques I’ve described in person, I regularly attend game development events and I’m not hard to find. Keep an eye on the ARMMultimedia twitter feed to see what events we’re attending next. Alternatively, drop a comment in the section below.

↧

Game Set and Batch

October 22, 2013, 7:49 am

≫ Next: ASTC does it

≪ Previous: Nothing but Static

At the Develop 2012 conference in Brighton I gave a talk about how we achieved some of the effects in our brand new (at the time) demo Timbuktu. As I repeated this presentation at a number of developer events, one particular section of it got longer and longer as I incorporated additional information to pre-answer the most common questions I was receiving. When the opportunity arose to write new presentations, expanding that one section into a presentation by itself seemed an easy win.

While we’re talking about easy wins, have you ever found that as you develop an application with lots of models you reach a point where, regardless of the complexity of the models, each new model you add drops the frame rate? There’s a chance that you’ve hit the draw call limit, which coincidentally is what that presentation I mentioned before was about.

There’s a limiting factor in graphics which is nothing to do with the GPU itself and entirely about CPU load associated with sending commands to the driver. This load is generated from calls to glDrawElements and glDrawArrays, often referred to by the collective name ‘draw calls’. Everything up to a draw call is simply setting states in the driver software. The point when the draw call is issued, all that state gets bundled up and sent to the GPU in a language it can understand, so that the GPU can then work on rendering it all without any further communication with the driver.

Depending on the CPU you’re using this figure changes but as a rule we try to stay under 50 draw calls per frame in our internal demos, less if possible, and we maintain this limit despite having a complex virtual world by the use of batching.

Batching is a technique whereby you draw multiple things in one draw call. The simplest way to imagine it is you take a number of different models and put them all in the same vertex buffer object. Then you render the whole buffer as one. If the objects have different textures, they are combined into one big texture atlas and the texture coordinates are rescaled to look up the correct points in the atlas rather than the individual texture. Finally, in order to make sure the objects can move independently, the vertices have an extra attribute, basically an ID number tagged to each vertex to tell it what model it’s part of.

In the vertex shader you then give an array of uniform mat4 values, rather than the single world space transformation typically used, and the ID number can look into this array to find the right one. Thus you can have different models with different textures in different positions with different scale and rotation factors, all moving independently with a single draw call.

If you do this with different models it’s a way of batching together a scene, though take note that the objects will always be drawn in the order in which they are lined up in the VBO, which makes it a little harder to depth sort the scene. If the models are identical you can draw them in the right order because it doesn’t matter which model ID represents which particular instance of that model.

Using a batch like this to represent multiple instances of the same object also offers an additional technique with pretty much no overhead. By filling a VBO with the same object at different levels of detail, starting with the most detailed and ending with the least, the detail level will switch automatically, so long as you draw your instances front to back.

When batching different objects in a scene, sometimes the issue of occlusion or removing objects from a scene comes up. Models at the start of the batch can be skipped by starting at a later vertex, and reducing the vertex count will stop before the end, but if you are drawing a batch of models and want to skip a few in the middle, the quick way to take them out is by passing a matrix of zeroes into that part of the uniform array, essentially scaling it to always be at world space origin and completely degenerate. However, if you have a sparsely rendered batch of objects (basically, if from the first to last model you render, there are more models skipped with a zero matrix than actually rasterized to the screen) it may work out more efficient to render it in more than one draw call. If you do a lot of batching and the application is constantly vertex bound irrespective of how much is currently drawn, this might be a sign that you’re transforming lots of batched vertices to null matrices.

If you’ve been proactive in your batching you should be sufficiently under the CPU load limit to draw a VBO with several passes, using different starting vertices and different vertex counts to draw subsets of the buffer. Exactly how you slice it is dependent on your application, but using the CPU and vertex shader load in ARM® Streamline™ Performance Analyzer you should be able to make the right choices.

The final question which usually arises is how to perfectly depth sort different objects within a batch, for example if the objects were alpha blended and needed to be sorted back to front. There’s no perfect solution for this, although depending on your use case there are a number of partial solutions. If you’re working with a small number of objects, you could store an index buffer of the objects swizzled in every possible permutation, and pass the right order through to the draw call. Faced with a larger number of objects I’d suggest reducing the alpha blended geometry down to their own separate topologically identical meshes. Often alpha blended models are mostly opaque with one specific part that is alpha blended, such as a model of a tree with a few textured leafy parts or a car with transparent windows. If the transparent parts are simple enough they can be made topologically congruent and use parameters to convert what each mesh represents on the fly.

A good example of this is merging different types of foliage into a batch. In Timbuktu we did this first by making the opaque parts, tree trunks and the like, into a separate geometry batch. Then the grass, shrubs, treetops and bushes could all be represented by a mesh which looked like a couple of crossed rectangles, textured rotated and scaled based upon what the mesh was meant to be. The texture bounds within the texture atlas were passed as an array, just like the matrices, allowing the models to be re-ordered freely and still represent different things in world space.

All these techniques are described in a presentation I gave on the ARM booth at GDC 2013, which later got combined with my other presentation from that event and recorded for the Mali developer website. You can watch the video right now:

↧

ASTC does it

October 22, 2013, 8:07 am

≫ Next: New samples in Mali SDK

≪ Previous: Game Set and Batch

In 2011 I attended the ARM Global Engineering Conference, where I saw a presentation about a new algorithm used in texture compression. I expected it to be about colour space conversions and perceptual filters, but the entire talk was about number encoding, and how non power of two number ranges could be stored more efficiently in trinary and quinary digits (trits and quints respectively) rather than binary bits.

Although the talk was very interesting I came away a little disappointed, feeling like I learned something very simple when I expected something complicated. But since then I’ve had to explain it to several people and realised it’s actually not as simple as it seemed, Tom Olson just explained it very well. For a while it still seemed kind of useless though.

Turning the clock forward again, I was recently tasked with writing up an introductory document for the new Adaptive Scalable Texture Compression (ASTC) algorithm. The document served two purposes, firstly as a user guide for the compressor on our developer resources site, and secondly as an explanation of how the compression algorithm itself works.

I think it’s important for people to know how ASTC works because only then can they confirm that it actually does work. If you instead just say “This is ASTC, it can compress an RGBA image to less than one bit per pixel” most people will just instinctively call shenanigans at the very thought of it. It barely seems possible to get it down to 1 bit per pixel, when you start quoting figures of 0.89 people’s brains go out of the window. It’s lossy compression, so you don’t have to burn the entire history of mathematics and data representation just yet, but the popular consensus is that it’s still better than pretty much anything else out there.

Low bit rates in texture compression are achieved by compressing pixels in blocks. Typically the blocks are completely self contained, so you don’t need any other data to decompress it; no lookup tables or dictionaries. Taking ETC as an example, 4 bits per pixel are achieved by compressing a 4x4 block into a 64 bit word using a couple of low bit rate base colours and even lower bit rate per texel offsets. This known data rate means a texel coordinate can quickly be converted into a block position, and therefore into a data offset to decompress. The 4 bits per pixel is fixed for ETC though, as is the maximum quality of the algorithm.

ASTC, by comparison, allows you to trade off quality against bit rate. As the name suggests, it is also scalable, so you can increase the bit rate to improve quality, anywhere from 8 bits to 0.86 bits per pixel. Actually it can go as low as 0.59 if you’re using 3D textures, but let’s not get ahead of ourselves.

The way this is achieved is by using variable block sizes. Not variable in a data size, every block is 128 bits regardless of the bit rate. The footprint of the blocks varies from a 4x4 block to a 12x12 block (with a couple of fruity rectangles in the mix too, giving you even more bit-rate choice) so the bit rate is 128 bits divided by the number of pixels in the block. 16 pixels in 128 bits is 8 bits per pixel, and 144 pixels in 128 bits is 0.89 bits per pixel. ASTC is quite unique in that it can also compress textures in 3D blocks, which go from 3x3x3 to 6x6x6. That’s 216 values, in 128 bits, or 0.59 bits per pixel.

I guess I’ve not really explained the ‘how’ part yet though. Every block has sets of endpoint colours. So if you had a block that entirely consisted of different shades of blue, your end points would be a light and a dark blue. If the block was something more interesting , like flames, the end points would be more like yellow and red or, if it’s a small block, a yellowish orange and a reddish orange; after all there’s no point in the range containing colours that aren’t in the block. Then the individual samples are an offset between these two values, so the block can have all kinds of different gradient patterns. Since you’re probably aiming for less than one bit per pixel, you obviously can’t sample every pixel exactly, so it samples intermittently and interpolates between them.

That probably sounds like it might lead to blurry looking images with blocky colour leaking, but that’s only half the story. Because the block can have up to 4 different colour gradients, each with a start and end point. There’s a single set of samples for where in the gradient the texel lies, but which gradient a specific texel is taken from comes from a second layer, an index pattern which tells it per texel which gradient to pick from. The great thing is, this set of indices doesn’t have any interpolation so it gives nice crisp edges on contrasting colours between the soft parts of the gradients.

The obvious follow on question is how do you fit any kind of per texel value into a texture if you don’t even have a bit? The answer is hash patterns. I’m going to drop my favourite picture here.

ASTC generates these little colour partition blocks from a 10 bit seed. You’ll notice there’s a mix of two-partition, three-partition and four-partition blocks. There’s also a one partition block but it’s not as interesting.

The multiple partition count is quite important; the fewer partitions the texture has, the more data is available to fine tune the end points and the gradient values in between. So while more partitions might give crisper details, the softer parts between those details will suffer slightly.

All this talk of tradeoffs are actually a strong hint of how compression works. Decompression happens in a single fast pass, often in the graphics hardware itself so you don’t even notice it. The gradient offset is pulled from the sample map, the hash is generated to look up the gradient end points and the final colour is calculated from them. Compression, on the other hand has to take a block and figure out how many partitions to use, which hash pattern to pick for them, what the end point colours should be for each one and, for those awkward times when the partitions don’t match the colours perfectly, where exactly does a blue pixel lay in a gradient between red and yellow?

This is why compression takes so long. After all, the most important thing is how quickly a texture decompresses during the rendering. Compression actually leverages the speed of decompression to its advantage, as the algorithm tests out partitions, picks end points from the colours in the block and fills the samples with the a value calculated in comparison to the corresponding gradient, and then decompresses the result into a block of pixels. It then uses error metrics to give a single value of how wrong this attempt was. It does this lots and lots of times, always holding onto the least wrong result.

The block is considered ‘done’ either after a certain number of attempts or when the error value falls below a given threshold. I like to imagine that the algorithm either gets fed up and takes the best of a bad bunch or finds a decent one and says “close enough”.

The compression time can be varied by telling it how close is close enough and how many attempts to try before it gives up. Some images may actually come out pretty good with a lower compression time, as the early attempts fit pretty well, others may look awful with the first few attempts and need longer. The user control of these factors means you get to make the call if it needs it.

Back the 128 bits, I know that some of the readers probably still don’t quite believe me. You know that 10 bits are used for the hash pattern, and in the worst case you need 4 sets of colour end points. These are RGBA, so 4 values each for 8 colours, plus a grid of 16 interpolated samples. That means you only have 2.45 bits for each value. But you can’t have 2.45 bits, you can only have 2 bits, so that means you have to quantise it oddly so that some get 3 bits and some get 2, otherwise you’re wasting 22 whole bits, and you can’t just waste bits if you’re going for a low bit rate. Then the algorithm needs to know which 22 values are 3 bits and which 26 are 2 bits. Any way you slice it, the math ends up strange. Since the number of partitions changes the number of end points, it would also need different known data sizes for different types and different weights and it all gets horribly complicated.

Except you don’t have to do that, because in 2011 I attended a talk at the ARM General Engineering conference, where Tom Olson explained a new technique in texture compression. By representing values in trinary and quinary numbers they can pack more densely. If you’re working in base 5, your digit runs from 0 to 4. The first quint is 1s, the second quint is 5s, and the third quint is 25s. In these three digits you have a numerical range from 0 to 124, and 124 can be represented in 7 bits.

If you wanted to hold a value from 0-4 in binary, you’d need 3 bits. Therefore those 3 values, would take 9 bits. But 9 bits can hold a value from 0 to 511, not to mention the fact that it’s really hard to move 9 bits around, so you’d need to really use 16 bits. 16 bits can hold 5 3 bit blocks, with one unused bit at the end. Use that space for quints and it’ll hold 6 of them, with 2 unused bits at the end. Similar results are seen with trits.

It’s not just about the uniform representation of non power of two value ranges that makes this so interesting. The choice of base 3 and base 5 are quite clever from numerical reasons. If you normalise the value and make it a representation of a gradient between 0 and 1, the values in a quint represent ( 0, ¼, ½, ¾, 1 ) and a trit will represent ( 0, ½, 1 ), so it even makes the blending faster by using power of two divisors.

When I mentioned blocks of 3 bits, some of you might have said, “Yes, but those hold 8 values”. What eight normalized values would they have held? They would have held sevenths. Have you ever tried to get a computer to work with sevenths?

I didn’t think so.

If you’d like to know more about how all this fits together and try out ASTC for yourself, there are a number of documents available on the Mali developer website alongside the evaluation codec encoder, and you can also find our OpenGL® ES emulator which supports ASTC.

If you’d like to talk about any of this, my colleagues and I regularly attend game development events where we’re not hard to find. Keep an eye on the ARMMultimedia twitter feed to see what events we’re attending next. Alternatively, start the conversation in the comments section below.

↧

New samples in Mali SDK

October 23, 2013, 4:16 am

≫ Next: EGLImage - updating a texture without copying memory under Android

≪ Previous: ASTC does it

I often receive questions such as “How can I render shadows with OpenGL® ES 2.0 without an available depth texture extension?”; “How can I render a simple text with OpenGL ES 2.0?”; or “Simple text rendering does not produce a high quality result, how could I improve this?”.

With these questions in mind I decided to come up with sample codes and whitepapers to present to this particular developer audience, whilst being aware that there are much more sophisticated approaches which other developers will currently be working on. These tutorials present basic approaches on how to get something quickly working on a device. I encourage you to use any sample(s) you are interested in as part of your project. Below you can find short descriptions of what I have just released.

Shadow Mapping

This tutorial presents an approach to shadow rendering without using depth textures, which are not available in OpenGL ES 2.0 unless the OES_depth_texture extension is available. The sample was written on the basis of a projective texture mapping technique. This technique does as many rendering passes as the number of lights, plus one final pass to draw the objects with shadows on top of them.

Please see the "Shadow Mapping - realtime shadow rendering with OpenGL ES 2.0" whitepaper in order to familiarize yourself with the approach.

Simple Text Rendering

This sample presents one of the simplest approaches to dynamic 2D textured text rendering in 3D space. When I say “dynamic”, I mean a text which may change from frame to frame. An application must remain in real time whilst doing this.

In the tutorial you will find more information on how to improve text quality by implementing commonly used OpenGL features. Please see the sample code and the "Simple text rendering - improving quality and performance" whitepaper for more details.

High Quality Text Rendering

This is another approach to rendering textured text. Compared to the method above, this approach focuses more on how to achieve the best possible quality and rendering performance - but you might lose performance or get some delay when changing (building) texts objects. Having that in mind you should understand that this approach is not suitable for texts which are changed often – one change per several frames should be fine.

For this approach you need to use a proper font engine as you no longer want to rely on a fixed size font from a texture atlas (as it is described in the “simple text rendering” sample). The font engine is going to produce a texture data with the whole text written down. The texture is then presented on a simple quad.

If you want to find out more please read both the "High quality text rendering - improving quality for textured text" whitepaperand the source code.

Fur:

This sample demonstrates how to make a fur effect in real-time. The technique does not require any advanced GPU features. The example is based on OpenGL ES 2.0, but can be made even with the very first version of the API.

The idea is based on a semi-transparent object being rendered several times. Every time a convex object is rendered, its scale is increased and alpha decreased respectively.

If you want to find out more about this technique I encourage you to get familiar with this source code and the "Fur - realtime rendering technique using OpenGL ES 2.0" whitepaper.

↧

EGLImage - updating a texture without copying memory under Android

October 24, 2013, 7:02 am

≫ Next: Saving System Power with ARM Multimedia IP

≪ Previous: New samples in Mali SDK

As many developers struggle to use EGLImage in the Native Development Kit (NDK) under Android™, I wanted to help a bit. Before I dive into more details I would like to explain briefly what the EGLImage extension is. EGLImage is an EGL extension to be used when texture content is going to be updated very often – more or less every frame. As you might already know, the glTexImage and glSubTexImage functions are not suitable for this kind of operation because data is being copied/converted in the drivers from CPU to GPU memory in order to be compliant against the Khronos standard. The EGLImage extension makes use of contemporary mobile chipsets where the CPU and GPU share the same physical memory. As long as they share the same memory, copying data may not be required in certain cases and this is exactly where EGLImage comes in. Once the content of the EGLImage has been updated on the CPU it has an immediate effect on the texture being rendered on a screen.

With that in mind I decided to write a simple example of how to use the extension, as this is not an easy task, especially for people who are not very familiar with Android, and especially because prebuilt Android is usually also required. But do not be afraid if you do not want to build Android on your own. I have implemented a simple library called gbuffer, which is a wrapper on top of the Android interface and using this you will not have to build Android. Once you have developed your application with the library it should run on any other non-modified Android platforms that support the EGLImage extension. The library itself does not contain any specific implementation for the ARM® Mali™ GPU architecture.

Please see the example source code and the EGLImage - updating a texture without copying memory under Android whitepaper which explains the above in more detail and has a step by step guide to what you should do in order to get EGLImage up and running in your application.

↧

Saving System Power with ARM Multimedia IP

October 28, 2013, 4:09 am

≫ Next: Mali Developer Resources

≪ Previous: EGLImage - updating a texture without copying memory under Android

ARM’s Multimedia IP portfolio is designed to work together to reduce overall system power while delivering the performance that is central to the mobile device experience.

What is the Multimedia Experience?

Most of the interactions that users have with modern tablets and smartphones count as a multimedia experience, integrating sound, vision and interaction for every task. From our point of view, the most complex part of this experience is vision - pushing pixels.

Device resolutions are growing fast. Already, smartphones commonly run at 1080p and tablets sport 2560x1600 display panels. There is no sign of this trend slowing down any time soon. At the same time, the display refresh is expected to be smooth, with 60 frames per second now seen as a minimum target rather than a luxury.

The amount of computing power required to calculate each pixel is also going up. Operating systems usually allocate a separate frame buffer for each application on the screen, and then compose these outputs onto the final display. The applications themselves use more sophisticated shaders to represent lighting and surface detail, and this applies to UIs as well as just games.

All of this requires more computation, more memory bandwidth and, unless we are careful, a lot more power.

Multimedia System Components

The multimedia system consists of a number of components which each serve a different function. Each task is handled by a hardware block which has inputs, intermediate data and outputs, all of which contribute to the total power budget for that task.

Here are some use cases which will hopefully illustrate what I mean and introduce each of ARM’s IP blocks:

Use case	Hardware	Input data	Intermediate data	Output data
Video encoding	Camera, ARM^® Mali™-V500 VPU	Uncompressed frames	Reference frames	Video stream
Video decoding	Mali-V500 VPU	Video stream	Reference frames	Uncompressed frames
UI	Mali-T628 GPU	Geometry, Textures		Rendered images
Gaming	Mali-T628 GPU	Geometry, Textures	G-buffers (render-to-texture)	Rendered images

Standalone Optimization

The most obvious place to start is to optimize each IP block on its own. The ARM Mali GPU team is architecting the Midgard series of GPUs and video processors to be best-in-class in terms of efficiency.

Working closely with the semiconductor foundries, we have also developed the ARM POP™ IP for Mali GPUs, which increases the performance per watt of our GPUs using a number of targeted low-level optimizations. We have customized the cell library and layout rules to best match the characteristics of the manufacturing process. We also created compound cells such as multi-bit flip-flops and custom memory layouts that best serve the unique requirements of GPU data.

We also include additional tools for defining and implementing power gating rules, so that existing clock-gating strategies can be extended to reduce static as well as dynamic power.

Bandwidth, Bandwidth, Bandwidth

One of the most power hungry parts of the system is the memory. As the number and complexity of pixels increases, so does the memory bandwidth requirement.

Memory technology is getting better and lowering power for each access, but the increase in bandwidth requirement overwhelms that trend. Here are some approximate values for just the DRAM chip itself:

These values are for 2 channel LPDDR, averaged from various online sources. Add in the memory controller power and interconnect, and the problem only looks worse.

Another problem apart from the demand for raw bandwidth is that the “Random” in “Dynamic Random Access Memory” isn’t that random any more. With the RAM core speed not increasing, the interfaces are serializing wider and wider internal access widths onto the bus. If you want 1 byte, you still get its 63 neighbours along for the ride. ARM IP is designed to ensure good locality of access so that those additional bytes are likely to contain data which will be required soon. The job of the cache on a mobile GPU is as much to control memory bandwidth as to increase performance.

It’s good to get the best bandwidth for your data, of course, but the data that take the least bandwidth are the data that you never read (or write).

Texture Compression

In some games, 90% of the memory read bandwidth can be texture accesses. Anyone who has read this blog for a while will know that I am about to sing the praises of ASTC again. And I will, but only in a very quick summary. If you want more details, see my previous blog.

ARM’s ASTC, or Adaptive Scalable Texture Compression, is a new texture compression format which offers better quality at lower bit rate than all of the current low dynamic range compression schemes available today, and matches the performance of the de facto high dynamic range codec. By allowing content developers to more finely tune their texture compression, ASTC will reduce the bandwidth required for textures further.

And now, for the first time, ASTC-capable hardware is in the hands of consumers thanks to the ARM Mali-T628 based Samsung Galaxy Note 3. For the consumer, the inclusion of ASTC means that applications which make use of it will have visibly better texture quality and smaller texture size, often at the same time. Smaller texture sizes result in faster downloads and, most importantly of all, lower power consumption as the GPU requires fewer memory accesses to display them.

Transaction Elimination

All the ARM Mali GPUs are tile-based renderers, coloring the pixels for a single tile of the screen in a small internal memory before writing it out to main memory. However, if the pixels in the tile have not changed since the last time it was written, there is no need to write it again. We can eliminate the memory transaction.

This reduces the bandwidth required to write the frame buffer, and with resolutions going up all the time, the frame buffer bandwidth is considerable. For the larger tablets, a 2560x1600 pixel display at 24 bits/pixel and 60 frames per second update requires a whopping 750MB/s just to write.

Transaction elimination helps to reduce that by between 30 and 80 percent, especially in crucial long-running use cases like UI, web browsing, and Angry Birds.

Frame Buffer Compression

How can we reduce frame buffer bandwidth further? An obvious idea is to compress it, but it’s not so obvious how. We need to preserve quality, so it should be lossless. It needs to be fast and cheap to compress and decompress. For video and GPU use cases, it also needs fast random access.

With the ARM Mali-V500 video processor and future High-End Midgard GPUs, we have included support for ARM Frame Buffer Compression, or AFBC for short.

This is the secret of the ARM Mali-V500’s astonishingly low bandwidth figures. By compressing the intermediate reference frames used by the video codec, the bandwidth drops dramatically. Typical Blu-Ray content can be compressed by 40%, and this saving is multiplied with every read and write. For details, see my colleague Ola’s blog Mali-V500 video processor: reducing memory bandwidth with AFBC.

Future High-End Midgard GPUs will support AFBC for input, so they can directly use compressed video input, and also for output. This supports the popular technique of G-buffering, where intermediate rendering results are rendered out by the GPU and reused as textures in a final pass. This can be used to reduce computation but at the expense of bandwidth. By using AFBC, the bandwidth is reduced and the applicability of the technique widens.

Tying the System Together

These bandwidth reduction techniques can be applied to single cores, but the full potential is only realizable using a fully joined-up approach.

With all the IP blocks in the system supporting these technologies, we can achieve significant end-to-end savings in bandwidth and power.

And that will help to ensure that your next smartphone or tablet not only looks cool, but is cool too.

↧

Mali Developer Resources

October 28, 2013, 5:25 am

≫ Next: Introducing the ARM Mali-T700 GPU series: Innovated to (efficiently) power the next generation of devices

≪ Previous: Saving System Power with ARM Multimedia IP

If you are enjoying all the information available here in the ARM Connected Community then we have another site dedicated to supporting people who develop on Mali GPUs which you should check out as well. The Mali Developer Centre hosts a range of free, tried and tested developer resources which will enable you to bring your visual computing projects to life more easily.

Below are some quick links to help you find what you need:

Asset Creation Tools

Performance Analysis Tools

Software Development Tools

SDKs

Drivers

You should also check out the sample codes, developer guides and development platform information that are also available on the site.

If you have a question relating to any of the tools, please use the “Ask a Question” feature and one of the team will get back to you shortly.

↧

Introducing the ARM Mali-T700 GPU series: Innovated to (efficiently) power the next generation of devices

October 29, 2013, 9:29 am

≫ Next: GPU Compute for Mobile Devices at ARM Techcon Developer Summit

≪ Previous: Mali Developer Resources

Better performance or longer battery life? Or both?

Many mobile phone users have experienced the disappointment of having their favourite device run out of juice when they are far away from a power socket. Some of them may also have experienced the warm but uneasy feeling of their device overheating while they are hooked on an astonishingly addictive new 3D game. In the engineering world, these problems of battery autonomy and thermal constraints are known to be the two facets of the most significant challenge for the mobile computing industry: Designing high performing processors that can withstand the increasing demands for rich visual experience on a limited power budget.

This week at TechCon we are excited to introduce a new generation of ARM® Mali™ GPUs. The ARM Mali-T760 GPU and the ARM Mali-T720 GPU extend the previously accepted boundaries of energy efficiency and have been optimized to address the divergent requirements of the high performance and low cost market segments respectively.

As the market evolves, so too do our GPUs

The mobile computing industry has been booming over the last few years, with smartphone volumes increasing significantly and all indications pointing towards the acceleration of this trend in the future. Smartphone shipments are expected to grow by a massive 5x factor from 2010 to 2015. Emerging markets, some with massive populations, are key contributors to this growth.

However, due to financial restrictions on a wide part of the population in these countries, the entry level sector benefits the most and will account for almost half of overall smartphone volumes. These entry-level devices need to provide a similar user experience to their higher-end counterparts, but with manufacturing costs well below $150 to ensure a viable profit margin. In order to achieve that, manufacturers optimize costs in every step of the manufacturing process. Time to market is also very critical for this segment because prices drop dramatically when a competitor brings a similar solution to the market, squeezing the already low profit margins.

At the other end of the smartphone spectrum, superphone development is driven by the demand for increasingly higher performance within the static mobile power budget. High-end tablets have already broken the barriers of full HD 1080p screen resolution and are pulling the pixel density race to a whole new level, setting 4K2K (UHD) resolution as the new ultimate target. At the same time High Dynamic Range applications demand 10-12 bits precision per colour plane, contributing to the ever increasing amount of data that a graphics processor needs to handle. New memory technologies like LPDDR4 have been deployed to sustain this increasing need for higher bandwidth. However, the power consumption imposed by higher bandwidths has not been an easy problem to resolve and becomes the limiting factor for high-end mobile devices.

The new ARM Mali GPUs address the requirements of these two different market segments by introducing a variety of new features that redefine what is technically possible.

ARM Mali-T760 GPU for performance-optimized devices

The ARM Mali-T760 GPU is designed with a focus on high performance at the same time as high energy efficiency. It reaches a 400% improvement in these metrics over previous generations of ARM Mali GPUs.

It supports all new graphics and GPU Compute programming interfaces (APIs) such as Direct3D® 11.1 feature level 11, OpenGL^® ES 3.0*, and OpenCL™ 1.2, so guarantees compliance with the latest and greatest graphics and compute content.

A significant achievement of the ARM Mali-T760 GPU is the new L2 cache interconnect that provides a cache coherent view of every L2 cache instance for every shader core and makes sure that memory bandwidth is evenly distributed among them. It supports extended scalability of up to 16 shader cores with linear performance improvement which allows the highest levels of performance without compromising on area efficiency. From a physical implementation perspective, it reduces the wire count between the L2 cache and shader cores and so enables easy timing closure, high layout utilization and low pin congestion.

Smart Mali Technologies reduce bandwidth

Smart Composition is a new technology introduced for the first time in the ARM Mali-T760 GPU. It has been developed to reduce bandwidth while reading in textures during frame composition. Smart Composition can reduce standard Android™ UI texture read bandwidth by better than 50%. By analyzing frames prior to final frame composition, Smart Composition determines if any reason exists to render a given part of the frame or whether the previously rendered and composited part can be reused. If that portion of the frame can be reused then it is not read from memory again or recomposited, thereby saving additional computational effort. In addition the ARM Mali-T760 GPU supports ARM Frame Buffer Compression (AFBC), the unique lossless compression capability implemented to optimize bandwidth usage further, Transaction Elimination and Adaptive Scalable Texture Compression.

Other features of the ARM Mali-T760 GPU include YCrCb frame buffer output and hardware assisted global illumination. Both are designed to increase fidelity and balance memory bandwidth to the system.

ARM Mali-T720 GPU – best for cost-optimization

The ARM Mali-T720 GPU is designed for performance density and ease of implementation in order to address the cost and time to market challenges of the entry-level smartphone segment. It achieves more than a 150% improvement in energy efficiency and graphics performance over previous generations of cost-optimized ARM Mali GPU solutions.

The ARM Mali-T720 GPU is based on the Midgard architecture, which enables it to benefit from the latest API support, plus bandwidth optimization features such as ASTC textures and Transaction Elimination. However, it has gone through extensive micro-architecture modifications that boost its performance efficiency close to the unparalleled density that can be achieved by the previous generation Utgard architecture. More specifically, it brings industry-leading OpenGL ES 3.0 support* to the mid-range mobile segment and is tuned to provide excellent graphics and compute support for the Android operating system, including support for RenderScript and FilterScript. Additionally, if Linux is the chosen operating system then OpenCL can be used alongside OpenGL ES 3.0.

To dramatically reduce implementation effort and enable fast time to market, the ARM Mali-T720 GPU has been optimized for a reduced number of routing layers while increasing layout utilization to maximize layout and time closures. In addition, ARM POP^TM IP and hard macro implementations will be available from ARM’s Processor IP Division to guarantee the best in class power, performance and area results with minimum implementation effort.

Does all this sound exciting? Let me know what you think of our new GPUs in the comments section below.

*Product is based on a published Khronos Specification, and is expected to pass the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org /conformance

↧

GPU Compute for Mobile Devices at ARM Techcon Developer Summit

October 29, 2013, 3:05 pm

≫ Next: Optimizing GPU Compute Kernels.

≪ Previous: Introducing the ARM Mali-T700 GPU series: Innovated to (efficiently) power the next generation of devices

timhar01 and I have just finished presenting a workshop about GPU Compute on mobile devices at the ARM Techcon Developer Summit in Santa Clara, California.

The workshop was 3 hours long and was a rapid tour of the current GPU Compute landscape with a focus on mobile and ARM® Mali™ GPUs in particular.

Although fairly mature on the desktop, GPU Compute is relatively new to the mobile space. The Mali-T600 series of GPUs were the first GPUs by ARM to introduce GPU Compute. The presentation started by giving an overview the current landscape of GPU Compute on mobile, some of the use cases and the available APIs.

The presentation then focused on the details of two of the APIs that the Mali-T600 series supports. We looked at OpenCL™ and RenderScript APIs in particular; how they work and how to use them. These APIs are hardware abstraction layers: they are generic for all hardware; however there are some implementation-specific parts, especially when it comes to optimisation. Because of this I first presented the generic version and then moved on to how the APIs map to the underlying Mali hardware.

After we had established the basics of OpenCL and the Mali hardware Tim took over to get into the details of optimisation. He presented some top tips for writing high performance OpenCL code for Mali GPUs. To sum up all of these techniques and to go through an OpenCL optimisation process, he then went through a small case study. We started with a naïve version of the Laplace image filter and went through various iterations applying some of the tips and techniques presented earlier while looking at the performance numbers.

You can take a look at the slides which are available here GPU Compute for Mobile Devices at Techcon (the two blank slides are videos which are here: Mali OpenCL Flag Demo and here: Mali OpenCL Face Detection Demo). If you have any questions for me, feel free to ask them in the comments section below.

↧

Optimizing GPU Compute Kernels.

November 6, 2013, 3:40 am

≫ Next: Mali OpenCL Flag Demo

≪ Previous: GPU Compute for Mobile Devices at ARM Techcon Developer Summit

If you have some previous experience with GPU compute, or if you have watched the GPU Compute for Mobile Devices at ARM Techcon Developer Summit presentation, and you have a Compute application that you want to optimize, it may be hard to know where to start. You have been given some advice, but it can be hard to know what kinds of optimizations are relevant for your particular kernels.

At the ARM Techcon Developer Summit, I talked about that problem, trying to give an intuition about how threads whirl around inside the cores while executing your kernels. As always, a prerequisite to successful optimization is obtaining some understanding of where the bottlenecks might be. For Mali, the first part of this presentation aims at giving that understanding. Armed with an understanding of how execution happens, the hardware counters in the GPU give the necessary capability of looking inside the cores to see what is actually going on while your program is running. Streamline gives a nice time-line view of many kinds of counters, and the second part of this presentation introduces them and their use for optimizing Compute kernels.

If you have any questions, this website is the place to ask.

Have fun!

↧

Mali OpenCL Flag Demo

November 11, 2013, 7:14 am

≫ Next: What is behind Mali's current momentum?

≪ Previous: Optimizing GPU Compute Kernels.

This is a demo created internally at ARM by Anthony Barbier.

Mali OpenCL Flag Demo

The demo shows the performance improvements you can achieve when using OpenCL™ on a Mali powered device.

The application is simulating a cloth flag with a ~6000 vertex model. Every frame, for each for these vertices, the application is calculating the affect of the forces of gravity, wind and spring forces between the vertices.

The demo is shown running on the Samsung Exynos 5250 Arndale Board from InSignal which has a dual core ARM^® Cortex^®-A15 CPU and a quad core ARM Mali™-T604 GPU.

Performance

The version shown first is written in multithreaded C running on the CPU (without using ARM Neon™ technology). This uses 100% of both cores of a dual core Cortex-A15 CPU but only achieves around 4-5 fps. You can see that visually this is not a nice result, the scene is too slow and the movement of the cloth is therefore not smooth. The GPU is being underutilised in this version (less than 1% utilisation). It's a resource of the system which could be put to good use.

Next, the OpenCL version is shown running on a Mali-T604 GPU. In this version, we render two flags (~12000 vertices) at around 36 fps. The flag looks much better now, and the intended simulation effect is much more obvious. The CPU usage in this version has fallen to single digits allowing it to be used for other tasks, for more features, or to sleep to reduce power usage. This shows a 16x performance improvement over the CPU version of the code (2x the number of vertices, 8x the frames per second).

This goes to show that for parallel applications such as this, OpenCL on a Mali device can provide superior performance. Each data point in this application can be calculated independently of all others and therefore, because the Mali GPU is very good at doing parallel processing (up to 256 hardware threads per core), it can easily outperform the CPU which is designed more for good sequential performance (one hardware thread per core).

OpenGL^® ES and OpenCL Interoperability

The other interesting thing shown in this demo is efficient OpenGL ES and OpenCL interoperability. In the application OpenCL is used to manipulate the flag model data and then OpenGL ES is used to render it to the screen. Typically, the model data would be manipulated on the host (CPU) side of the application and then uploaded to the GPU for OpenGL ES to render. The host would upload the data into a VBO (Vertex Buffer Object) so the GPU has access to it. In a naïve system, you can imagine that in this demo you would have to (every frame):

manipulate the data using OpenCL
map the memory to a CPU pointer on the host side
upload the data to a VBO for it to be rendered.

Thankfully, this is not the case as this would increase memory usage (increasing power usage) and reduce performance by needlessly copying memory. Instead the two APIs can share the same piece of memory directly.

Hopefully we will have an example of this in one of our Mali SDKs soon.

↧

What is behind Mali's current momentum?

November 13, 2013, 8:26 am

≫ Next: New Mali-based Android Mini-PC on the market

≪ Previous: Mali OpenCL Flag Demo

Firstly, a little bit of history for those new to the Mali™ world: ARM® moved into the GPU space in 2006 after acquiring the Norwegian company Falanx Microsystems and their Mali GPU architecture. ARM had recognized the increasing importance of the graphics market for mobile, automotive and home applications due to the surge in the number of devices with graphics capabilities. ARM aimed to build upon their existing graphics activity and develop integrated solutions for SoCs in multimedia-rich embedded applications, complemented by a collaborative ecosystem of developers.

Now in 2013, we can see the vast inroads that ARM has made. Within the entire of the GPU market for personal mobile devices, the ARM Mali GPU share stood at 18% in 1H13[1], an increase of 37% from 1H12. Furthermore, if we analyze the data for the burgeoning Android™ market we see that ARM Mali GPUs are found in over 70% of Digital Smart TVs, over 50% of Android tablets and over 20% of the Android smartphones. The ARM Mali GPU is, in fact, the world’s number one Android GPU IP supplier and the most widely licensed GPU of all, with over 85 ARM Mali licenses to date. It has experienced more than ten times growth in the past two years, with 2013 looking to ship double the volume of GPUs of 2012 (>300 million units).

So why are companies choosing ARM Mali IP?

ARM’s rapid growth in the GPU market can be laid down to several key factors:

1. Lowering System Power– ARM develops energy efficient IP. To this end ARM Mali engineers are constantly innovating new solutions to reduce the power consumption of our GPUs, delivering to customers IP that enables OEMs to extend the battery life of their mobile and consumer devices. A combination of tile-based and immediate-mode rendering, integrated L2 caches with unified memory access, internal clock gating, multiple levels of job management, GPU Compute functionality and bandwidth-reducing technologies all contribute to making ARM Mali-based SoCs more energy efficient.

2. Coherent ARM-based SoC solutions – With expertise across every element of an SoC,ARM can offer to semiconductor companies IP for every individual component, enabling them to create a complete and holistic SoC solution, including technology (such as ARM POP™ IP) that enables them to get to market faster with superior performance and at a lower cost. ARM Multimedia IP has been intrinsically designed to be simpler to implement with the ubiquitous ARM Cortex® CPU and have also been designed to natively support 64bit and are compatible with the latest ARMv8 IP.

3. Leading on GPU Compute – GPU Compute solutions enable compute-intensive tasks within an application to be offloaded on to the GPU, the processor which is far more efficient at processing massive data-parallel workloads. It enables superior graphics performance and extended battery life and is rapidly becoming the norm for mobile devices, opening up an entire new market for graphics-rich, innovative user experiences. The ARM Mali Midgard architecture is designed to integrate the graphics and compute functionalities together, optimizing interoperation between the two and delivering market leading 3D graphics and general purpose parallel computation. It was the first GPU architecture to bring Full Profile GPU Compute to mobile devices.

4. Reducing complexity –The ARM Mali GPU architecture has been developed to reduce complexity. One single driver stack for all multicore configurations of a GPU simplifies application porting, system integration and maintenance. The provision of an industry standard AMBA^® AXI interface makes integration of an ARM Mali GPU into system-on-chip designs straight-forward, and also provides a well-defined interface for connecting to other bus architectures. ARM POP IP is also available for certain ARM Mali GPUs to further decrease time to market and improve the dependability of your product. In addition, multicore scheduling and performance scaling is fully handled within the graphics system, with no special considerations required from the application developer.

5. Enabling end-to-end customer solutions– In order for end-products such as smartphones, high-end tablets and DTVs to succeed, OEMs do not just need an SoC. ARM is at the centre of a global network of more than 1,000 silicon and software companies which create a complete solution, from design to manufacture and end use, for products based on the ARM architecture. ARM offers a variety of resources to ecosystem members, including promotional programs, social media and industry networking opportunities that enable ARM Partners to come together to provide end-to-end customer solutions. The ARM Mali ecosystem is still young but growing rapidly, with well over 125 public partners already included. Their areas of business range from Services and Standards through to Computational Photography and Computer Vision.

6. Support at every step of the way – ARM offers a detailed support program to customers in order to help them be successful. For example, the ARM Services Division offers ongoing support, training and documentation to ARM licensees. As another example, the ARM Mali Developer Centre offers a wealth of tools, drivers and SDKs for developers who are working with ARM Mali GPUs in order to help them get the highest performance possible out of their applications.

7. A GPU for every occasion– ARM currently offers eleven separate graphics processor licenses and one video processor. Each of these can be scaled from one to a maximum of sixteen cores with the ARM Mali-T760. This offers 72 different GPU implementation choices, and that’s even before the generational advancements to shader cores, memory systems and software drivers are taken into account, or the semiconductor companies’ customizations of the GPUs post-licensing. This broad IP range enables semiconductor companies and OEMs to offer a diverse array of differentiated end products, all from the one GPU architecture.

What are your thoughts on the subject? Let us know in the comments below.

[1]“Qualcomm Single Largest Proprietary GPU Supplier, Imagination Technologies the Leader in GPU IP, ARM and Vivante Growing Rapidly, According to Latest Report From Jon Peddie Research” http://jonpeddie.com/press-releases/details/qualcomm-single-largest-proprietary-gpu-supplier-imagination-technologies-t

↧

New Mali-based Android Mini-PC on the market

November 15, 2013, 9:08 am

≫ Next: Intex to launch smartphone with Mediatek's octa-core MT6592

≪ Previous: What is behind Mali's current momentum?

This week a new Android (4.2) Mini PC was launched with the ARM® Mali™-400 GPU at its heart – the Rikomagic MK902.

Based on the Rockchip RK3188 processor, this new device is larger than its predecessors but offers more than a traditional low-price Google console. On top of the 600,000 games on the Google Play store which are easily accessible and playable when the device is paired with a controller, the Rikomagic MK902 boasts a 5MP webcam, enabling you to hold high-quality video conferences and Skype calls from your TV. Also available are 4 USB ports, micro-SD slot, HDMI output and an ethernet jack, all from a very affordable £89.

The Rockchip RK3188 is made up of a quad-core 1.6GHz ARM Cortex®-A9 CPU and the popular quad-core 600MHz ARM Mali-400 GPU for 2D and 3D graphics acceleration.

The official product site is available here.

And you can also check out some of the latest reviews of the device here:

Rikomagic MK902 Android Mini PC Launches (Geeky Gadgets)

Rikomagic MK902: Super charged Android PC even packs a webcam (The Gadget Show)

↧

Intex to launch smartphone with Mediatek's octa-core MT6592

November 20, 2013, 1:55 am

≫ Next: New Mali Graphics Debugger v1.2 Features Improve Fragment Performance

≪ Previous: New Mali-based Android Mini-PC on the market

This week MediaTek launched the MT6592, a heterogeneous computing SoC which combines an advanced eight-core ARM® Cortex®-A7 processor configuration with industry-leading multimedia capabilities and mobile connectivity.

Building on the advanced 28nm HPM high-performance process, the MT6592 has eight CPU cores, each capable of clock speeds up to 2GHz. The architecture is fully scalable, and the MT6592 enabling both low-power and more demanding tasks to run effectively by using the eight cores in any combination. An advanced MediaTek scheduling algorithm also monitors temperature and power consumption to ensure optimum performance at all times.

On the multimedia side, the MT6592 SoC uses the quad-core ARM Mali™-450 GPU, offering great performance density with area-optimization, low latency and energy efficiency. The ARM Mali-450 MP GPU is an enhanced version of the hugely successful ARM Mali-400 MP GPU which brings top-of-the range 2D and 3D graphics acceleration to consumer electronics. Linear performance scaling and a focus on maximizing processing efficiency, even when rendering at four times Full Scene Anti-Aliasing (4xFSAA), ensures superior image quality and provides a perfect solution for 2K and 4K screen resolutions. Thanks to an integrated configurable L2 cache with shared memory access, multiple levels of power management and a combination of immediate-mode and tile-based rendering, it is also able to coherently control memory bandwidth and lower overall system power consumption. The ARM Mali-450 MP GPU provides double the fill-rate and geometry throughput of the ARM Mali-400 MP GPU - whilst being a fast, low-risk route to market thanks to the maturity of the underlying design and compatibility with the same DDK that is used by the ARM Mali-400 MP GPU.

Intex have now announced that it will be the first to launch an Indian smartphone with the new octa-core Mediatek chipset, offering high performance for multi-tasking at an affordable price-point for consumers. In their press release, Sanjay Kumar Kalirona, Business Head-Mobile at Intex Technologies said that their "quest to become the top smartphone player in the industry has made us bring extraordinary products with such revolutionary technology in the hands of consumers".

For more on the release, see these articles:

MediaTek - MediaTek Launches MT6592 True Octa-Core Mobile Platform

Intex Technologies to launch smartphone with MediaTek octa-core chip - Financial Express

Intex becomes first Indian company to launch 'Octacore' smartphone - Economic Times

Intex to launch a 6-inch HD smartphone with 1.7 GHz Octa-Core MediaTek MT6592 processer

↧

New Mali Graphics Debugger v1.2 Features Improve Fragment Performance

December 2, 2013, 3:29 am

≫ Next: HTML5, The Art of Illusion

≪ Previous: Intex to launch smartphone with Mediatek's octa-core MT6592

Analysing Performance

When we talk about optimizing performance, we could mean a number of different things. The first thought that comes to most people’s minds in the context of a gaming application is frame rate (FPS). FPS is very important, after all a jumpy or sluggish application will disappoint your end user. However, at ARM we also like to include battery performance within this. As good as the FPS might be, there is little point if the end user is unable to play the game for a reasonable amount of time.

The other scenario is that you are completely satisfied with the FPS you are achieving on a chosen device. However, the vast array of mid-range devices in the market that are more than capable of running the content offer a rich, often untapped, opportunity if a little optimization was carried out, allowing you to further monetize on your content. It’s always worth spending some time on optimization and with the great tools that ARM provides, a lot of the guess work and pain can be avoided.

ARM® DS-5™ Streamline™ has been a tool available for some time which allows developers to optimize applications across the whole of any system with an ARM Cortex-A CPU and ARM Mali GPU. Using the tool, you could not only improve FPS but also power efficiency.

Understanding When You Are Fragment Bound

Using DS-5 Streamline, you can quickly identify where in the system the bottleneck lies. In a graphics application there are a number of areas where this could be the case: CPU, Vertex Processing, Fragment Processing and Bandwidth. However, to understand the cause of the bottleneck from an application perspective, you need the visibility and features provided by Mali Graphics Debugger.

In the case of the latter two, Fragment Processing and Bandwidth, we have a number of new features in the latest Mali Graphics Debugger v1.2 to address the common issues.

Overdraw

Overdraw is the term used when you are writing out a fragment more than once. This often occurs when you have transparency in a scene or when you are drawing your objects from back to front. Overdraw is not a bad thing, you might need it to achieve a certain effect - unnecessary overdraw is what we want to avoid. For example, unnecessary overdraw occurs if you are drawing with transparency when it is not required or, as mentioned above, you are drawing your objects from back to front causing the Fragment Processing on the GPU to perform work on pixels that may never be seen, reducing performance and also burning precious joules.

The Overdraw Map feature in the Mali Graphics Debugger can be enabled with a simple toggle in the UI allowing the device to switch to a special mode displaying overdraw. Once in this mode you can capture a frame and step through each draw call to see how the overdraw is being built up and then identify the offending draw call.

Figure 1. Overdraw Map

Figure 1 above shows the overdraw for a game application. The whiter the area on the map, the more overdraw there is (black being zero overdraw). The view allows a simple means to identify areas where you could reduce overdraw.

Shader Utilization

The Mali Graphics Debugger has a useful feature allowing you to capture all the shaders (vertex and fragment) being utilized in a frame along with cycle count information of each shader. With only this information, it’s difficult to identify the shaders you need to spend optimization effort on. Say for example you have a fragment shader being reported as 12 cycles compared with 5-6 cycles for the other shaders. Instinct might dictate that you start optimizing the 12 cycle shader. If that particular shader results in only rendering 4 pixels on the screen, there is unlikely to be any performance gain from working on this shader. A new feature called the Shader Utilization Map in the Mali Graphics Debugger v1.2 allows you to identify which are the most expensive shaders in terms of real usage. i.e. the shaders that took the most amount of GPU processing time.

Figure 2. Shader Utilization Map

Figure 2 above is the same gaming application mentioned in the overdraw example. This view allows you to see clearly which shader contributed to which pixel and understand where optimization efforts can start.

Figure 3. Shader Statistics

Along with the shader statistics (Figure 3), you are able to see the instances run of each shader and the total cost. A benefit of using this view is that it will also allow you to identify opportunities for batching draw calls, resulting in lower CPU overhead and also reducing the amount of state change.

Mali Graphics Debugger v1.2

The above features are all a part of the latest release of Mali Graphics Debugger, available on the Mali Developer Centre Site. We have also added a number of other features to the latest release. These include:

Support for visualizing ASTC compressed textures
Vertex Shader utilization
Frame buffer thumbnails
Many other improvements and fixes.

Head over to the Mali Developer Centre now to download your free copy. Please leave your feedback below.

↧

HTML5, The Art of Illusion

December 4, 2013, 7:43 am

≫ Next: [VIDEO] Learn how to create 2D and 3D games using the Unity platform

≪ Previous: New Mali Graphics Debugger v1.2 Features Improve Fragment Performance

In a previous life, I was given the challenge of building a responsive, expandable UI framework that would run at high performance on multiple devices from desktop, tablets and phones all the way to Set Top Boxes and Smart TVs. I was thinking "Wow - what a great challenge!", then I was told that it had to be implemented in HTML5 and my heart sank. At the time I had no HTML development experience, and had also heard how large companies such as Facebook were moving away from HTML because they were unable to get the required performance to achieve a good user experience.

But the engineer in me was curious. All the target platforms were capable, they had good ARM® Cortex® CPU's and ARM® Mali™ GPUs: there must be a way to get the desired performance. The fact that Mark Zuckerberg had stated that the "Biggest mistake we took as a company was betting too much on HTML5 as opposed to native ... We were never able to get the quality we wanted", was like a red rag to a bull, I had to try and prove him wrong! If however you delve a little further into what Mark was saying, HTML5 was not, at the time, the right choice to target mobile platforms. The variance in spec conformance at the time was large, the rendering pipelines were not well optimized, and knowledge on the idiosyncrasies of developing for an embedded platform were not well understood at the time.

HTML5 was new to me. I had done a little work with Flash® Lite® 4, but the only takeaway I had from that experience was that I never wanted to work with ActionScript®, an ECMAScript based language again! This is probably not being fair to Adobe, who had traditionally done a great job with Flash, but from a developers perspective, the version ActionScript that came with FlashLite 4 could never work out which API it wanted to conform to. So the realization that Javascript (JS) was ECMAScript based, but without some of the good parts of ActionScript such as Object Oriented (OO) constructs like Inheritance, was not a good start. Also, understanding the concept of separation of content and layout through HTML and CSS was new. As you can see, there was a lot to learn!

But here lies a strength. If you approach a problem with very little knowledge, you are not bringing any previous prejudices with you. You are facing the problem with fresh eyes. This is me trying to give that great big cloud a silver lining!

So, development started... slowly... and I soon stumbled across the concept of the Polyfill. It turns out that due to the incredibly flexible nature of JS, you are able to add functionality that previously did not exist in either the JS or HTML specification. I was able to recreate some of the benefits of OO based inheritance by use of polyfills. An example of writing Polyfills can be found here.

Next I discovered was that there is a great community of HTML5 developers out there and a huge number of blog posts and tutorials to help the budding HTML5 App developer on their way. The greatest discovery here was how unbelievably fabulous the Chrome Developer Tools were! I'm not going to go into much detail about them here because there is already a huge amount of material online to get you started. If you have never used them before, I would suggest starting here, also checking out some of the tutorials at http://www.html5rocks.com and following the hashtag #perfmatters.

So, now I have the power tools for development and debugging, the polyfill to smear over the gaps in the HTML and JS APIs and a great set of libraries to help me build complex HTML5 applications (my choices here were http://requirejs.org/, http://backbonejs.org/, http://underscorejs.org/ and the ubiquitous http://jquery.com/). We are finally ready to start development.

The first thing I noticed was that, even though things were looking great on the desktop platforms, performance on the target embedded devices was shocking! Doom and gloom! Maybe Mark was correct - HTML5 is not yet ready for the prime time? But this was where the developer tools came to the rescue. Quick analysis of the application whilst running animations using the Chrome Devtools Timeline feature showed that I was spending a huge amount of time painting the contents of the page.

I found this strange because I was only moving assets across the screen, not updating their contents. In the image above, you can clearly see large areas of the screen being repainted. Surely this should be a simple composition job? Why was the whole screen being re-rendered on every frame?

It turns out that you have to be more expressive in your CSS to help the browser know which elements are going to be moved. This can be done by promoting the DOM (Document Object Model) element into its own layer. A good description of this method is covered here, but the basic premise is that if you apply a null 3D transform to the DIV (Document Division Element), for example:

<code>

.animate {

transform: translate3d(0,0,0);

}

</code>

will cause a new layer to be created for the DIV to be rendered into. If you then want to change the position of that item, you can update that same 3D translation to move the item around the screen. Because the DIV is now represented in a layer, which in turn is backed by a GPU texture, there is no need to repaint any part of the DOM when re-positioning the DIV and the animation boils down to a simple GPU backed composition job. Now we are working to the GPU's strengths!

Great! Problem solved!! Well, almost...

The use case being implemented in this example is a scrolling list. One factor of a scrolling list is that items are added to the front and removed from the end of the list. There are two ways in which I could have implemented the list:

Have all list items available in the DOM when the screen loaded
Dynamically add and remove items as the list scrolls

The problem with #1 is that there is no way of knowing how many items there will be in the list. I could be representing a channel lineup on TV, where we could have 800 items. If each of these items has a channel logo, and maybe an image representing what is currently showing on the channel, the memory requirements of the list will be huge! The likelihood would be that the application would abort with a failure and become unresponsive.

So #2 was chosen, and a framework built to manage the adding and removing of items. But there are problems with this approach as well. Every time we change the DOM, we get a 'judder' in the UI as the DOM recalculates layout, downloads and decodes images for new assets and renders the contents into a new layer. So every time we move the list, we are triggering major layout and repaint events.

Bring on the Magician

So, the title of this blog talks about illusion.... This is where we need a magician, and a good one at that!

The solution to this particular problem is to understand the difference between 'interaction time' and 'browser time'. I'm not sure if these are industry standard names for what I am about to talk about, but they have become how I refer to the issues.

Interaction time is the time when I am interacting with the UI. This is also the time at which my perception of application framerate is at its highest. Imagine I am using a touch device and I am sliding my finger across the screen. If the item I am moving does not track my finger immediately and without lag, I will notice and the experience will be poorer as a result.

Browser time is the time in-between interaction time. During this time, the framerate could fluctuate drastically and I wouldn't notice. There would be no impact on my experience during this time.

The idea is to defer any operations that would impact the framerate of the application into browser time. By doing this, you are disguising the inefficient paths of the browser and making much better use of the correct resources when they are needed.

This blog does not cover the whole of the story, I will be posting more in the weeks to come. The bulk of this post was covered in a presentation I gave at the Developer Summit at ARM TechCon™ earlier this year.

I will also be giving a version of the presentation again at DevCon5 in California on the 10th December, it would be great to see you there.

Conclusion

----------

So, did I prove Mark wrong - well yes and no.

It turns out that efficient HTML5 development is hard, but its not impossible. So long as you are willing to put in the effort to understand what is happening under the hood, you stand a much better chance of crafting an application that performs well on your target devices. The great news is that the ARM Mali Ecosystem team are working hard with a number of browser vendors to make sure that future browser engines perform well on ARM Mali based GPU's.

I found the experience challenging but ultimately rewarding, and I would encourage you to get involved in the HTML5 community and build beautiful bodacious web apps for our ARM based devices.

↧

[VIDEO] Learn how to create 2D and 3D games using the Unity platform

December 5, 2013, 8:04 am

≫ Next: [VIDEO] Introducing Play Games

≪ Previous: HTML5, The Art of Illusion

At ARM® TechCon™ Developer Summit the ARM Mali™ team were joined on the Tuesday (Gaming Day) by Carl Callewart of Unity3D. Carl gave a great demonstration of how easy it is to build a simple 2D game on the Unity platform:

Unity and ARM have worked together to implement optimizations to improve graphical performance for the ARM Mali GPU as well as to make it accessible to developers using Unity. With the large catalogue of Unity-authored titles, Mali-based devices now have a huge selection of high-quality games with console quality visuals populating devices.

If you're interested in this video and want to learn more, you should also check out this presentation written by our developer education team Optimizing Unity Games for Mobile Platforms.

↧

[VIDEO] Introducing Play Games

December 5, 2013, 8:11 am

≫ Next: [VIDEO] Heterogeneous Performance with OpenCL and MxPA

≪ Previous: [VIDEO] Learn how to create 2D and 3D games using the Unity platform

Watch this video to discover Google's game platform for Android™, iOS and the web that is social, cross-platform and robust. Daniel Galpin of Google introduced Play Games to the ARM® TechCon™ Developer Summit in October and explained to the audience how to get the most out of it.

↧

[VIDEO] Heterogeneous Performance with OpenCL and MxPA

December 5, 2013, 8:17 am

≫ Next: [VIDEO] The Marmalade™ SDK

≪ Previous: [VIDEO] Introducing Play Games

At ARM® TechCon™ Developer Summit we were joined on the Tuesday by John Stratton, Senior Architect at MulticoreWare:

With OpenCL™ support for ARM Mali™ GPU devices, there are new opportunities for energy-efficient kernels in many application areas. However, getting the best performance can be a challenge because the simplest programming practices do not always fully utilize the GPU’s capabilities. Even then, for many, adopting OpenCL means accepting the development and support for multiple implementations of important kernels for both GPU and CPU. The Multicore cross-Platform Architecture from MulticoreWare addresses both of these issues, with an OpenCL kernel optimizer and a full OpenCL implementation for ARM CPUs that can be bundled and shipped with an application executable.

For more information on how best to work with GPU Compute, you can also check out this document GPU Compute for Mobile Devices at Techcon.

↧

[VIDEO] The Marmalade™ SDK

December 5, 2013, 8:29 am

≫ Next: [VIDEO] Goo Technologies make HTML5 and WebGL look easy

≪ Previous: [VIDEO] Heterogeneous Performance with OpenCL and MxPA

Imagine if you could develop once and then deploy to any ARM®-powered device.

Imagine doing this without a web application, without running an expensive Virtual Machine or Runtime and whilst taking full advantage of the ARM architecture.

Enter the Marmalade SDK, a powerful, cross-platform tool that enables developers to deploy across multiple devices from a single base.

In the video below Donald Beaston, COO of Marmalade, gives an under-the-hood view of Marmalade technology, how it leverages the ARM® architecture and how it can benefit developers.

↧