Mali Graphics Performance #2: How to Correctly Handle Framebuffers

April 28, 2014, 4:29 pm

≫ Next: Inside the Demo: GPU Particle Systems with ASTC 3D textures

≪ Previous: ARM Mali Compute Architecture Fundamentals

This week I take a slight diversion from the hardware-centric view of the rendering pipeline we have been exploring so far to look at how, and more importantly when, the Mali driver stack turns OpenGL ES API activity into the hardware workloads needed for rendering. As we will see, OpenGL ES is not particularly tightly specified around this area, so there are some common pitfalls which developers must be careful to avoid.

Per-Render Target Rendering: Quick Recap

As described in my previous blogs, Mali's hardware engine operates on a two-pass rendering model, rendering all of the geometry for a render target to completion before starting any of the fragment processing. This allows us to keep most of our working state in local memory tightly coupled to the GPU, and minimize the amount of power-hungry external DRAM accesses which are needed for the rendering process.

When OpenGL ES is used well we can create, use, and then discard most of our framebuffer data inside this local memory. This avoids the need to read framebuffers from, or write framebuffers to, external memory at all, except for the buffers we want to keep such as the color buffer. However this isn't guaranteed behavior and some patterns of API usage can trigger inefficient behavior which forces the GPU to make extra reads and writes.

Open GL ES: What is a Render Target?

In Open GL ES there are two types of render target:

On-screen window render targets
Off-screen framebuffer render targets

Conceptually these are very similar in OpenGL ES; although not entirely identical. Only one render target can be active at the API level for rendering at any point in time; the current render target is selected via a call to glBindFramebuffer( fbo_id ), where an ID of 0 can be used to switch back to the window render target (also often called the default FBO).

On-screen Render Targets

On-screen render targets are tightly defined by EGL. The rendering activity for one frame has very clearly defined demarcation of what is one frame and what is the next; all rendering to FBO 0 between two calls to eglSwapBuffers() defines the rendering for one frame.

In addition the color, depth, and stencil buffers in use are defined when the context is created, and their configuration is immutable. By default the value of the color, depth, and stencil immediately after eglSwapBuffers() is undefined - the old value is not preserved from the previous frame - allowing the GPU driver to make guaranteed assumptions about the use of the buffers. In particular we know that depth and stencil are only transient working data, and we never need to write them back to memory.

Off-screen Render Targets

Off-screen render targets are less tightly defined.

Firstly, there is no equivalent of eglSwapBuffers() which tells the driver that the application has finished rendering to an FBO and it can be submitted for rendering; the flush of the rendering work is inferred from other API activities. We'll look more about the inferences the Mali drivers support in the next section.

Secondly, there are no guarantees about what the application will do with the buffers attached to the color, depth, and stencil attachment points. An application may use any of these as textures, or reattach them to a different FBO, for example reloading the depth value from a previous frame as the starting depth value for a new frame. By default the behavior of OpenGL ES is to preserve all attachments, unless explicitly discarded by the application via a call to glInvalidateFramebuffer(). Note: this is a new entry point in OpenGL ES 3.0; in OpenGL ES 2.0 you can access the equivalent functionality via the glDiscardFramebufferExt() extension entry point which all Mali drivers support.

Render Target Flush Inference

In normal circumstances Mali flushes rendering work when a render target is "unbound", except for the main window surface which is flushed when the driver sees a call to eglSwapBuffers().

To avoid a performance drop developers need to avoid unneeded flushes which only contain a sub-set of the final rendering, so it is recommended that you bind each off-screen FBO once per frame and render it to completion in one go.

A well-structured rendering sequence (almost anyway - read the next section to see why this is incomplete) would look like:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1
...                       // Draw FBO 1 to completion

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 2 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 2
...                       // Draw FBO 2 to completion

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 2 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)
eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

By contrast the "bad" behaviour would look like:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Rebind away from FBO 0, does not trigger rendering of FBO                          // However, rebinding FBO 1 requires us to reload old render                          // state from memory, and write over the top of it
glDraw...( ... )          // Draw something to FBO 1

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 (again)
glDraw...( ... )          // Draw something else to FBO 0 (window surface)
eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

This type of behavior is known as an incremental render and it forces the driver to process the render target twice, the first processing pass will need to write all of the intermediate render state out to memory (color, depth, and stencil), and the second pass will read it back in from memory again so it can "append" more rendering on top of the old state.

As shown in the diagram above you can see that incremental rendering has a +400% bandwidth penalty [assuming 32-bpp color and D24S8 packed depth-stencil] in terms of the framebuffer bandwidth when compared against a well-structured single pass render which avoids the need to write and then re-read the intermediate state to and from main memory.

When to call glClear?

The observant reader will have noted that I inserted some calls to glClear() into the rendering sequence for our frame buffers. The application should always call glClear() for every render target at the start of each render target's rendering sequence, provided that the previous contents of the attachments are not needed, of course. This explicitly tells the driver we do not need the previous state for this frame of rendering, and thus we avoid reading it back from memory, as well as putting any undefined buffer contents into a defined "clear color" state.

One common mistake which is seen here is only clearing part of the framebuffer; i.e. calling glClear() while only a portion of the render target is active because of a scissor rectangle with only partial screen coverage. We can only completely drop the render state when it applies to whole surfaces, so a clear of the whole render target should be performed where possible.

When to call glInvalidateFramebuffer?

The final requirement placed on the application for efficient use of FBOs in the OpenGL ES API is that it should tell the driver which of the color / depth / stencil attachments are simply transient working buffers, the value of which can be discarded at the end of rendering the current render pass. For example, nearly every 3D render will use color and depth, but for most applications the depth buffer is transient and can be safely invalidated. Failure to invalidate the unneeded buffers may result in them being written back to memory, wasting memory bandwidth and increasing energy consumption of the rendering process.

The most common mistake at this point is to treat glInvalidateFramebuffer() as equivalent to glClear() and place the invalidate call for frame N state at the first use of that FBO in frame N+1. This is too late! The purpose of the invalidate call is to tell the driver that that the buffers do not need to be kept, so we need to modify the work submission to the GPU for frame which produces those buffers. Telling us in the next frame is often after the original frame has been processed. The application needs to ensure that the driver knows which buffers are transient before the framebuffer is flushed. Therefore transient buffers in frame N should be indicated by calling glInvalidateFramebuffer()before unbinding the FBO in frame N. For example:

#define ALL_BUFFERS COLOR_BUFFER_BIT | DEPTH_BUFFER_BIT | STENCIL_BUFFER_BIT
static const GLEnum invalid_ap[2] = { GL_DEPTH_ATTACHMENT, GL_STENCIL_ATTACHMENT };

glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 0 (window surface)

glBindFramebuffer( 1 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 1
...                      // Draw FBO 1 to completion
glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 1 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

glBindFramebuffer( 2 )    // Switch away from FBO 0, does not trigger rendering
glClear( ALL_BUFFERS )    // Clear initial state
glDraw...( ... )          // Draw something to FBO 2
...                      // Draw FBO 2 to completion
glInvalidateFramebuffer( GL_FRAMEBUFFER, 2, &invalid_ap[0] ); // Only keep color

glBindFramebuffer(0)      // Switch to FBO 0, unbind and flush FBO 2 for rendering
glDraw...( ... )          // Draw something else to FBO 0 (window surface)

eglSwapBuffers()          // Tell EGL we have finished, flush FBO 0 for rendering

Summary

In this blog we've looked at how the Mali drivers¹ handle the identification of render passes, the common points of inefficiency, and how an application developer can drive the OpenGL ES API to avoid them. In summary we recommend:

Binding each FBO (other than FBO 0) exactly once in each frame, rendering it to completion in a contiguous sequence of API calls.
Calling glClear() at the start of each FBO’s rendering sequence, for all attachments where the old value is not needed.
Calling glInvalidateFramebuffer() or glDiscardFramebufferExt() at the end of each FBO’s rendering sequence, before switching to a different FBO, for all attachments which are simply transient working buffers for the intermediate state.

Next time I'll look at a related topic to this one – the efficient use of EGL_BUFFER_PRESERVE for maintaining window surface color from one frame as the default input for the next frame, and the implications that has for performance and bandwidth.

Cheers,

Pete

Footnotes

It is worth nothing that little of this is actually Mali specific - most of the mobile GPU vendors make the same recommendations, so this is general best practice, irrespective of the underlying GPU.

Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

↧

Inside the Demo: GPU Particle Systems with ASTC 3D textures

May 9, 2014, 8:01 am

≫ Next: Mali Weekly Round-Up

≪ Previous: Mali Graphics Performance #2: How to Correctly Handle Framebuffers

At GDC 2014 we demonstrated the benefits of the OpenGL® ES 3.0 API and the newly introduced OpenGL ES 3.1 extension. Adaptive Scalable Texture Compression format (ASTC) is one of the biggest introductions to the OpenGL ES API. The demo I’m going to talk about is a case study of the usage of 3D textures in the mobile space and how ASTC can compress them to provide a huge memory reduction. 3D textures weren’t available in the core OpenGL ES spec up to version 2.0 and the workaround was to use hardware dependent extensions or 2D texture arrays. Now with OpenGL ES 3.x, 3D textures are embedded in the core specification and ready to use…..if only they were not so big! Using uncompressed 3D textures costs a huge amount of memory (for example a 256x256x256 texture with RGBA8888 format uses circa 68MB) which cannot be afforded on a mobile device.

Why did we use ASTC?

The same texture can instead be compressed using different levels of compression with ASTC, giving a saving of ~80% when using the highest quality settings. For those unfamiliar with the ASTC texture compression format, it is a block-based compression algorithm where LxM (or LxMxN in the case of 3D textures) blocks of pixels are compressed together into a single block of 128 bit. The L,M,N values are one of the compression quality factors and represent the number of texels per block dimension. For 3D textures, the dimensions allowed vary from 3 to 6 as reported in the table below:

Block Dimension	Bit Rate (bits per texel)
3x3x3	4.74
4x3x3	3.56
4x4x3	2.67
4x4x4	2.00
5x4x4	1.60
5x5x4	1.28
5x5x5	1.02
6x5x5	0.85
6x6x5	0.71
6x6x6	0.59

Since the block compressed size is always 128 bit for all block dimensions, the bit rate is simply 128/#texel_in_a_block. One of the features of ASTC is that it can also compress HDR values (typically 16 bit per channel). Since we needed to store high precision floating-point values in the textures in the demo, we converted the float values (32 bit per channel) to half-float format (16 bit per channel) and used ASTC to compress those textures. In this way the loss of precision is less compared to the usual 32 bit to 8 bit conversion and compression. It is worth noticing that using the HDR formats doesn’t increase the size of the compressed texture because each compressed block will still use 128 bit. Below you can see a 3D texture rendered simply using slicing planes. The compression formats used are: (from left to right) uncompressed, ASTC 3x3x3, ASTC 4x4x4, ASTC 5x5x5.

For those interested in the details of the algorithm, an open source ASTC evaluation encoder/decoder is available at http://malideveloper.arm.com/develop-for-mali/tools/astc-evaluation-codec/ and a video of an internal demo ported to ASTC is available at https://www.youtube.com/watch?v=jEv-UvNYRpk.

Demo Overview

The main objective of the demo was to use the new OpenGL ES 3.0 API to realize realistic particle systems where motion physics as well as collisions are managed entirely on the GPU. The demo shows two scenes, one which simulates confetti, the other smoke.

Transform Feedback for physics simulation

The first feature I want to talk about, which is used for the physics simulation, is Transform Feedback. The physics simulation steps typically output a set of buffers using the previous step results as inputs. These kind of algorithms, called explicit methods in numerical analysis, are well suited to being used with Transform Feedback because it allows the results of vertex shader execution to get back into a buffer that can subsequently be mapped for CPU read or used as the input buffer for other shaders. In the demo, each particle is mapped to a vertex and the input parameters (position, velocity and lifetime) are stored in an input vertex buffer while the outputs are bound to the transform feedback buffer. Because the whole physics simulation runs on the GPU, we needed a way to give to each particle the knowledge of the objects in the scene (this is now less problematic using Compute Shaders. See below for details). 3D textures helped us in this case because they can represent volumetric information and can be easily sampled in the vertex shader as a classic texture. The 3D textures are generated from the 3D mesh of various objects using a free tool called Voxelizer (http://techhouse.brown.edu/~dmorris/voxelizer/) and the voxel data contain the normal of the surface for voxels on the mesh surface or the direction and the distance to the nearest point on the surface in the case of voxels inside the object. 3D textures can be used to represent various types of data such as a simple mask for occupied or free areas in a scene, density maps or 3D noise. When uploading the files generated from Voxelizer, we convert the floating point values to half-float and then compress the 3D texture using ASTC HDR. In the demo, we use different compression block dimensions to show the differences between uncompressed and compressed textures. Such differences included memory size, memory read bandwidth reduction and energy consumption per frame. The smallest block size (3x3x3) gives us a ~90% reduction and our biggest texture goes down from ~87MB to ~7MB. Below you can find a table of bandwidth measurements for the various types of models we used on a Samsung Galaxy Note 10.1 (2014 Edition).

	Sphere	Skull	Calice	Rock	Hand
Texture Resolution	128x128x128	180x255x255	255x181x243	78x75x127	43x97x127

Texture Size MB
Uncompressed	16.78	82.62	89.73	5.94	4.24
ASTC 3x3x3	1.27	6.12	6.72	0.45	0.34
ASTC 4x4x4	0.52	2.63	2.87	0.19	0.14
ASTC 5x5x5	0.28	1.32	1.48	0.10	0.07

Memory Read Bandwidth in MB/s
Uncompressed	644.47	752.18	721.96	511.48	299.36
ASTC 3x3x3	342.01	285.78	206.39	374.19	228.05
ASTC 4x4x4	327.63	179.43	175.21	368.13	224.26
ASTC 5x5x5	323.10	167.90	162.89	366.18	222.76

Energy consumption per frame DDR2 mJ per frame
Uncompressed	4.35	5.08	4.87	3.45	2.01
ASTC 3x3x3	2.31	1.93	1.39	2.53	1.54
ASTC 4x4x4	2.21	1.21	1.18	2.48	1.51
ASTC 5x5x5	2.18	1.13	1.10	2.47	1.50

Energy consumption per frame DDR3 mJ per frame
Uncompressed	3.58	4.17	4.01	2.84	1.66
ASTC 3x3x3	1.90	1.59	1.15	2.08	1.27
ASTC 4x4x4	1.82	1.00	0.97	2.04	1.24
ASTC 5x5x5	1.79	0.93	0.90	2.03	1.24

Instancing for efficiency

Another feature that was introduced in OpenGL ES 3.0 is Instancing. It permits us to specify geometry only once and reuse it multiple times in different locations with a single draw call. In the demo we use it for the confetti rendering where, instead of defining a vertex buffer of 2500*4 vertices (we render 2500 particles as quads in the confetti scene), we just define a vertex buffer of 4 vertices and call the:

glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, 2500 );

where GL_TRIANGLE_STRIP specifies the type of primitive to render, 0 is the start index inside the enabled vertex buffers that represents the positions of the vertices of the quad, 4 specifies the number of indices needed to render one instance of the geometry (4 indices per quad) and 2500 is the number of instances to render. Inside the vertex shader, the gl_InstanceID built-in variable will be available and it will contain the identifier for the current invocation. This variable can, for example, be used to access an array of matrices or do specific calculations for each instance. A divisor can also be specified for each active vertex buffer which specifies how the vertex shader will advance in the vertex buffers for each instance.

The smoke scene

In the smoke scene, the smoke is rendered using a noise texture and some math to compute the final colour as if it were a 3D volume. To give the smoke a transparent look we need to combine different overlapping particles’ colours. To do so we use additive blending and disable the z-test when rendering the particles. This gives a nice result even without sorting the particles based on the z-value (otherwise we have to map the buffer in the CPU). Another reason for disabling it is to realize soft particles. The Mali-T6xx series of GPUs can use a specific extension in the fragment shader to read back the values of the framebuffer (colour, depth and stencil) without having to render-to-texture. This feature makes it easier to realize soft particles and in the demo we use a simple approach. First, we render all the solid objects so that their z-value will be written in the depth buffer. After we render the smoke (and thanks to the Mali extension) we can read the depth value of the object and compare it with the current fragment of the particle (to see if it is behind the object) and fade the colour accordingly. This technique eliminates the sharp profile that is formed by the particle quad intersecting the geometry due to the z-test (another reason we had to disable it).

Blurring the smoke

During development the smoke effect looked nice but we wanted it to be more dense and blurry. To achieve all this we decided to render the smoke in an off-screen render buffer with a lower resolution compared to the main screen. This gives us the ability to have a blurred smoke (since the lower resolution removes the higher frequencies) as well as let us increase the number of particles to get a denser look. The current implementation uses a 640x360 off-screen buffer that is up-scaled to 1080p resolution in the final image. A naïve approach causes jaggies on the outline of the object when the smoke is flowing near it due to the blending of the up-sampled low resolution buffer. To almost eliminate this effect, we apply a bilateral filter. The bilateral filter is applied to the off-screen buffer and is given by the product of a Gaussian filter in the colour texture and a linear weighting factor given by the difference in depth. The depth factor is useful on the edge of the model because it gives a higher weight to neighbour texels with depth similar to the one of the current pixel and lower weight when this difference is higher (if we consider a pixel on the edge of a model, some of the neighbour pixels will still be on the model while others will be far in the background).

Bonus track

The recently released OpenGL ES 3.1 spec introduced Compute Shaders as a method for general computing on the GPU (a sort of subset of OpenCL™, but in the same context of OpenGL so no context switching needed!!). You can see it in action below:

An introduction to Compute Shaders is also available at:

Get started with compute shaders

References:

I would like to point out some useful websites that helped me understand Indexing and Transform Feedback:

Transform Feedback:

https://www.opengl.org/wiki/Transform_Feedback
http://prideout.net/blog/?tag=opengl-transform-feedback
http://ogldev.atspace.co.uk/www/tutorial28/tutorial28.html
http://open.gl/feedback

Indexing:

http://www.opengl-tutorial.org/intermediate-tutorials/billboards-particles/particles-instancing/
http://ogldev.atspace.co.uk/www/tutorial33/tutorial33.html
https://www.opengl.org/wiki/Vertex_Rendering#Instancing

ASTC Evaluation Codec:
http://malideveloper.arm.com/develop-for-mali/tools/astc-evaluation-codec/

Voxelizer:
http://techhouse.brown.edu/~dmorris/voxelizer/

Soft-particles:
http://blog.wolfire.com/2010/04/Soft-Particles

↧

Mali Weekly Round-Up

May 9, 2014, 9:32 am

≫ Next: IWOCL: International Workshop on OpenCL 2014

≪ Previous: Inside the Demo: GPU Particle Systems with ASTC 3D textures

Chukong Technologies and ARM raise the bar for mobile graphics performance

International mobile entertainment platform company Chukong Technologies, which powers seven out of the ten top grossing games in China, and ARM announced a partnership to optimize games built in Cocos2d-x for ARM®-based devices. Cocos2d-x is an open-source, cross-platform game engine developed and maintained by Chukong Technologies, which has been downloaded more than 1.1 million times by developers such as Wooga, Zynga, Gamevil and more.

For more information, read the full PR here.

Or if you want to learn more about Cocos2d-x, its co-founder gave an interview with ARM at GDC 2014:

ARM expects ~1bn entry level smartphones in 2018, $20 smartphones coming this year

One of the most picked-up stories this week was of James Bruce, ARM's Director of Mobile Solutions, discussing the entry-level smartphone market, pointing out the sub-$46 Coolpad7321 with its ARM Mali™-400 GPU as a case in point for smartphone affordability, as well as dropping hints of an imminent $20 smartphone coming soon.

For more information, read full articles here and here.

The ARM chips behind the wearable revolution

Hexus describes how the combination of an ARM Cortex® CPU and an ARM Mali GPU within a smartwatch is a potent combination which offers both efficiency and the great user interface that consumers are becoming used to.

For more information, read the full article here.

Huawei Ascend P7 Hands-on Review

This week saw many reviewers get their hand on the latest release from Huawei, the Ascend P7 with its HiSilicon KIRIN 910T processor featuring a Mali-450 GPU.

For more information, here are some reviews:

AnandTech | Huawei Launches Ascend P7 Based on Custom HiSilicon SoC

Huawei Ascend P7 hands-on review | ITProPortal.com

http://www.gizmodo.co.uk/2014/05/huawei-ascend-p7-hands-on-a-smartphone-for-generation-selfie/

Other Mali-based devices released this week include...

The Oppo Joy smartphone, its 3G-enabled budget offering, in the Indian market at Rs. 8,990.

Also launched in India this week was the Gionee P2D Android smartphone.

↧

IWOCL: International Workshop on OpenCL 2014

May 15, 2014, 7:02 am

≫ Next: Mali Weekly Round-Up

≪ Previous: Mali Weekly Round-Up

May 12 & 13 saw the annual gathering of the OpenCL™ community at the @Bristol Interactive Science Centre. IWOCL – not to be confused with EWOKL, a presumably fictional gathering celebrating lesser-known Star Wars characters – saw 115 OpenCL parallel programming experts sharing their latest research and developments. Over the event’s two days there was a wide variety of presentations, from academic research deep-dives to high-level commercial overviews.

115 delegates from 14 countries in the @Bristol conference venue

OpenCL has broad appeal and is seeing expanding use, from super-computer to mobile, serving a wide spectrum of use-cases from academic research through to hard-nosed commercial applications. Getting to grips with the flexible technology that makes all this possible is a fascinating and complex subject. Ultimately it’s about how you split an algorithm into (typically) a very large number of chunks and spread them across multiple processors for execution in parallel. OpenCL’s flexibility allows all sorts of different processors to be targeted, but a common use-case leverages the parallel processing power of the GPU. Of course the primary aim of all this endeavor has been to get things to run faster. With mobile – though performance is still important – there’s another benefit: power efficiency. As the demands on mobile devices increase this is becoming more and more important and the conference reflected this with a variety of mobile-related presentations. There are also numerous projects creating compilers for new and existing languages to compile down to OpenCL code – over 70 at the last count.

Plenty of interest in the live OpenCL and RenderScript demos on the ARM stand

We were delighted to be invited to present a workshop in one of the parallel tracks over the afternoon of the first day. (Parallel tracks at an OpenCL conference just seems right somehow). Johan Gronqvist and I gave an overview of the ARM® Mali™-T600 and Mali-T700 range of mobile GPUs, highlighting some of the features of the architecture and consequences for developers when optimising for Mali GPUs. This included some deep dive case studies for both a Laplace filter and an implementation of SGEMM covering matrix multiplication. As you might expect, there was plenty of experience in the audience and it led to some interesting and insightful questions. It was great to see the level of interest in compute on mobile, both during our talk and whilst meeting delegates throughout the conference.

Elsewhere the recent work of Khronos – keeper of the OpenCL standard amongst others – was on display, including SPIR (Standard Portable Intermediate Representation) which will avoid having to ship OpenCL source code with applications and allow a target format for compilers of other programming languages that would benefit from an OpenCL back-end. Also on show was SYCL, an abstraction layer for leveraging C++ and OpenCL and a keynote from Neil Trevett, chairman of the Khronos OpenCL group, about the past, present and future for the API.

Neil Trevett, Chairman of the Khronos OpenCL group

Adobe’s Eric Berdahl gave an entertaining perspective of how a commercial developer determines when to concentrate efforts on incorporating OpenCL to accelerate large media applications, and some of the pitfalls and advantages of doing so. And there were several other talks from a wide-ranging selection of the OpenCL community along with a healthy selection of posters, including one on using OpenCL with Python from our very own Anton Lokhmotov,who was also on the IWOCL program committee.

All in all, a fascinating event and we’re very grateful to the organisers for the opportunity to represent ARM. It was a lot of fun… roll on IWOCL 2015!

For more information:

IWOCL: http://iwocl.org/

SPIR: http://www.khronos.org/spir

SYCL: https://www.khronos.org/opencl/sycl

↧

Mali Weekly Round-Up

May 16, 2014, 8:33 am

≫ Next: Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

≪ Previous: IWOCL: International Workshop on OpenCL 2014

ARM Launches v.4.2 of the Offline Shader Compiler

This latest version is capable of compiling OpenGL ES 3.0 shaders and supporting the Mali-T600 series r3p0-00rel0 driver.

For example, the new Mali GPU Offline Shader Compiler can be used to get statistics for the shaders in the Skybox tutorial:

For the vertex shader, saved on disk as skybox.vert:

#version 300 es

out vec3 texCoord;
uniform mat4 viewMat;

void main(void)
{
    const vec3 vertices[4] = vec3[4](vec3(-1.0f, -1.0f, 1.0f),                             vec3( 1.0f, -1.0f, 1.0f),                             vec3(-1.0f, 1.0f, 1.0f),                             vec3( 1.0f, 1.0f, 1.0f));    texCoord = mat3(viewMat) * vertices[gl_VertexID];    gl_Position = vec4(vertices[gl_VertexID], 1.0f);
}

We can run the compiler in verbose mode:

malisc -v skybox.vert -V A

And we will get the following output:

ARM Mali Offline Shader Compiler v4.2.0

No driver specified, using "Mali-T600_r3p0-00rel0" as default.

No core specified, using "Mali-T670" as default.

No core revision specified, using "r1p0" as default.

Compilation successful. 3 work registers used, 6 uniform registers used, spilling not used.

A L/S T Total Bound Cycles: 9 9 0 18 A, L/S

Shortest Path: 2 9 0 11 L/S

Longest Path: 2 9 0 11 L/S

Note: The cycles counts do not include possible stalls due to cache misses.

For more information, visit the tool page.

Samsung Galaxy S5: Powered by Exynos

A great blog by the folk at Samsung Exynos covering the Exynos 5 Octa (5422) which was announced at MWC in 2014. With its big.LITTLE processing technology and ARM Mali-T628 GPU, it is a great SoC which offers both high performance and extended battery life.

Read the full blog here.

Android TV boxes: Tronsmart Vega S89 vs. Rikomagic M902

The Android STB market is a strong one for ARM with many SiPs choosing to implement our GPU IP in their designs. This week Liliputing delivered an in-depth comparison of two such implementations, that of the Tronsmart Vega S89 with its Amlogic S802 and ARM Mali-450 GPU and that of the Rikomagic M902 with its Rockchip RK3188 SoC and ARM Mali-400 GPU.

Read the full article here.

Xiaomi unveils a 4K, 3D Android-powered TV

Xiaomi is one of China's biggest producers of consumer electronics. The big news around their latest announcement of the Mi TV 2 is the pricetag: a 4K TV for $640 featuring an ARM Mali-450 MP GPU.

Read more about this release here.

↧

Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

May 16, 2014, 10:36 am

≫ Next: Mali Weekly Round-Up: Cadence and ARM Expand Collaboration to 64-bit

≪ Previous: Mali Weekly Round-Up

This week I'm finishing off my slight diversion into the land of application framebuffer management with an analysis of EGL_BUFFER_PRESERVED, and how you determine whether it is a good technique to use. This is a question which comes up regularly when talking to our customers about user-interface development and, like many things in graphics, its efficiency depends heavily on what you are doing, so I hope this blog makes it all crystal clear (or at least slightly less murky)!

What is EGL_BUFFER_PRESERVED?

As described in my previous blog, Mali Performance 2: How to Correctly Handle Framebuffers, in normal circumstances the contents of window surfaces are not preserved from one frame to the next. The Mali driver can assume that the contents of the framebuffer are discarded, and therefore it does not need to maintain any state for the color, depth, or stencil buffers. In EGL specification terms the default EGL_SWAP_BEHAVIOR is EGL_BUFFER_DESTROYED.

When creating a window surface via EGL it can alternatively be created with EGL_SWAP_BEHAVIOR configured as EGL_BUFFER_PRESERVED. This means that the color data in the framebuffer at the end of rendering of frame N is used as the starting color in the color buffer for the rendering of frame N+1. Note that the preservation only applies to the color buffer; the depth and stencil buffers are not preserved and their value is lost at the end of every frame.

Great, I can render only what changed!

The usual mistake most people make is that they believe this technique allows them to patch a small amount of rendering over the existing framebuffer. If the only thing which has changed on screen since the previous frame is the clock incrementing one second, then I just have to modify the clock in the taskbar, right? Wrong!

Remember that most real systems are running an N-buffered rendering scheme, sometimes double-buffered, but increasingly commonly triple-buffered. The memory buffer you are appending on top of when rendering frame N+1 is not the color buffer frame N, but probably that for frame N-2. Far from being a simple patch operation, EGL_BUFFER_PRESERVED forces the driver to render a textured rectangle containing the color buffer from frame N into the working tile memory for frame N+1.

As mentioned in one of my previous blogs, and covered by Sean Ellis's blog on Forward Pixel Kill (FPK), some of the more recent¹ members of the Mali GPU family have support for removal of overdrawn fragments before they become a significant cost to the GPU. In cases where overdraw on top of the previous frame is opaque (no blending, and fragment shader does not call "discard"), the overdrawn parts of the readback can be suppressed and consequently do not have a performance or bandwidth impact. In addition, if you have EGL_BUFFER_PRESERVED enabled but find you want to overdraw everything, then you can always just insert a normal glClear() call at the start of the frame's rendering to prevent the readback happening at all.

Is EGL_BUFFER_PRESERVED worth using?

So, accepting the need for this full screen readback, which is relatively straightforward when you starting thinking in terms of multi-frame rendering pipelines, the next question that I get asked is " should I use EGL_BUFFER_PRESERVED for my user interface application or not?"

Like many worthwhile engineering questions, the answer is not a simple "yes" or "no", but the more subtle "it depends".

The cost of EGL_BUFFER_PRESERVED is the full-frame load of the previous frame's data (excepting that killed by FPK) to populate the frame with the correct starting color. The alternative is re-rendering the frame from scratch, starting from the clear color. Whether using EGL_BUFFER_PRESERVED is the right thing to do therefore depends on the relative cost of these two things.

If your UI application is compositing multiple uncompressed layers which make heavy use of transparencies, then using EGL_BUFFER_PRESERVED is probably a sensible thing to do. The cost of one single layer of readback of the previous color data will be less expensive than recreating the color from scratch via the multi-layer + blending route.
If you have a simple UI or 2D game which is predominantly single layer, reading from compressed textures, then EGL_BUFFER_PRESERVED is very likely to be the wrong thing to do. The bandwidth overheads of the readback of the previous frame's color will be more expensive than recreating the frame from scratch.

It is obviously not always as clear cut as this — there are shades of grey between these two extremes — so care is needed when performing any analysis. If in doubt, use the GPU performance counters to review the performance of your real application running in place on your production platform, with and without EGL_BUFFER_PRESERVED enabled. Nothing will give a better answer than measuring your real use case in a real device . Some of the other blogs in this series, provide guidance on such application performance analysis, and I’ll be continuing to add more material in this area over the coming months.

However, when performing such performance experiments, it is important to note that the best applications are designed explicitly to work with (or without) EGL_BUFFER_PRESERVED; it is not normally as simple as just flicking an EGL configuration switch if you want to get the most efficient solution out of either route.

It is also worth noting that in a system with an ARM FrameBuffer Compression (AFBC) -enabled display controller, such as Mali-DP500, and GPU, such as Mali-T760, the bandwidth overheads of the EGL_BUFFER_PRESERVED blit readback can be significantly reduced, as the readback bandwidth will be that of the compressed framebuffer, which is typically in the range of 25-50% smaller than the uncompressed original.

A Better Future?

The behavior of EGL_BUFFER_PRESERVED is a nice idea, and in many cases still useful, but many of the theoretical advantages of it are lost in N-buffered systems due to the need to insert this readback of the previous frame's data.

We believe that applications — user interfaces in particular — could be made significantly more efficient if both the application and the buffer preservation schemes available explicitly expose (and can therefore exploit) the N-buffered memory model on a particular platform. If the application knows that the system is double buffered, and it knows the delta between the current state and the state two frames ago, then it is possible to get close to the architectural ideal of only modifying the regions in memory which have changed. This has the potential to reduce the energy consumption and memory bandwidth radically for mostly steady-state user-interfaces.

We can't share any technical details on this yet - but watch this space!

Tune In Next Time

This brings me to the end of my short diversion on framebuffer management, so next time we'll be back looking at using Mali with ARM DS-5 Streamline to investigate application performance bottlenecks and optimization opportunities.

TTFN,
Pete

Footnotes

FPK is supported from Mali-T620 onwards

↧

Mali Weekly Round-Up: Cadence and ARM Expand Collaboration to 64-bit

May 23, 2014, 8:57 am

≫ Next: Optimizing Fast Fourier Transformation on ARM Mali GPUs

≪ Previous: Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

ARM and Cadence EDA Technology Access Agreement

An EDA Technology Access Agreement announced on Tuesday gives Cadence access to ARMv7 and ARMv8-A architecture-based processor IP, ARM® Mali™ GPUs, System IP and physical libraries to enable tools optimized for these IPs and the development of designs achieving the required power, performance and area (PPA). The tighter integration of Cadence's tools with ARM technology will make SoC design, verification and implementation easier and help to shorten the time to market.

This collaboration will provide world-class technology for energy efficient and high performance applications for mobile, consumer, networking, storage, automotive and other end market products.As our Executive Vice President and President of Product Groups, Pete Hutton, said, "We are dedicated to empowering developers, designers and engineers to innovate around ARM technology and ensuring a fast, reliable route to market".

For coverage on the story, the full PR is available on Cadence's website.

SONIQ Launch Three SmartTVs Customized for the Chinese Market

Having announced last month that they were moving into the Chinese market, SONIQ released three SmartTVs specially designed for a Chinese household which, having been built in conjunction with China's largest internet security companies, are "the first Internet-safe TVs" suitable for home. All based on a dual-core Cortex®-A9 CPU and a built-in Mali-400 GPU, they all offer energy efficiency, high-definition image quality and high value for money as well as a strong range of apps.

Discover the full PR at Market Watch.

Doogee's Solutions for Mid-Range Mobiles are Put to the Test

Doogee is a relatively unknown Chinese brand but it is making strong headway in the rapidly expanding local entry-level smartphone market. The Android Authority got its hands on two of its 2014 offerings this week, the Doogee Turbo DG2014 and the Doogee Pixels DG350. Both these phones are priced under £100, boast a MediaTek MT6582 SoC with a quad-core ARM Cortex-A7 CPU and dual-core ARM Mali-400 GPU and perform well for their price range in benchmarking tests. The general conclusion is that competition in the entry-level market place is picking up, and you can definitely now get a good deal of mobile computing for your money.

New Mali-based Mini-PC on the Market

Equipped with a Rockchip RK3188 processor with an ARM Mali-400 GPU, the Cloudsto EVO runs Ubuntu Linux 12.04 and is a good option for those looking for a desktop device which offers the functionality of a PC (such as internet browsing, email reading, file management, photo editing, document creation and video watching) in a size which fits into the palm of your hand and costs roughly the same as an external hard drive! Alternatively, it is a great, energy-efficient solution for use cases such as digital signage.

See more at Geeky Gadgets or Liliputing.

↧

Optimizing Fast Fourier Transformation on ARM Mali GPUs

May 30, 2014, 3:37 am

≫ Next: Mali Weekly Round-Up: From Supercomputers to Entry-Level

≪ Previous: Mali Weekly Round-Up: Cadence and ARM Expand Collaboration to 64-bit

Fast Fourier Transformation (FFT) is a powerful tool in signal and image processing. One very valuable optimization technique for this type of algorithm is vectorization. This article discusses the motivation, vectorization techniques and performance of FFT on ARM® Mali™ GPUs. For large 1D FFT transforms (greater than 65536 samples), performance improvements over 7x are observed.

Background

A few months ago I was involved in a mission to optimize an image processing app. The application had multiple 2D convolution operations using large-radii blurring filters. Image convolutions map well to the GPU since they are usually separable and can be vectorized effectively on Mali hardware. However, with a growing filter size, they eventually hit a brick wall imposed by computational complexity theory.

An illustration of a 2D convolution.
Source: http://www.westworld.be/page/2/

In the figure above, an input image is convolved with a 3x3 filter. For the input pixel, there will be about 9 multiplications and 8 additions required to produce the corresponding output. When estimating the time complexity of this 2D convolution, the multiplication and addition operations are assumed as a constant time. Therefore, the time complexity is approximately O(n²), although when the filter size grows, the number of operations per pixel increases and the constant time assumption can no longer hold. With a non-separable filter, the time complexity quickly approaches O(n⁴) as the filter size becomes comparable to the image size.

In the era of ever increasing digital image resolutions, a O(n⁴) time is simply not good enough for modern applications. This is where FFT may offer an alternative computing route. With FFT, convolution operations can be carried out in the frequency domain. The FFT forward and inverse transformation each needs O(n² log n) time and has a clear advantage over time/spatial direct convolution which requires O(n⁴).

The next section assumes basic understanding of FFT. A brief introduction to the algorithm can be found here:

http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html

FFT Vectorization on Mali GPUs

For simplicity, a 1D FFT vectorized implementation will be discussed here. Multi-dimensional FFTs are separable operations, thus the 1D FFT can easily be extended to accommodate higher-dimension transforms. The information flow of FFT is best represented graphically by the classic butterfly diagram:

16-point Decimation in Frequency (DIF) butterfly diagram.

The transformation is broken into 4 individual OpenCL™ kernels: 2-point, 4-point, generic and final transformations. The generic and final kernels are capable of varying in size. The gerneric kernel handles transformations from 8-point to half of the full transformation. The final kernel completes the transformation by computing the full-size butterfly.

FFT operates within the complex domain. The input data is sorted into a floating point buffer of real and imaginary pairs:

The structure of the input and output buffer. A complex number consists of two floating point values. The vector width is also shown.

The first stage of the decimation in time (DIT) FFT algorithm is a 2-point discrete Fourier transform (DFT). The corresponding kernel consists of two butterflies. Each of these two butterflies operate on two complex elements as shown:

An illustration of the first stage kernel. A yellow-grey shadded square represents a single complex number. The yellow outline encloses the butterflies evaluated by a single work item. The same operation is applied to cover all samples.

Each work item has a throughput of 4 complex numbers[e1], 256 bits. This aligns well with the vector width.

In the second stage, the butterflies have a size of 4 elements. Similar to the first kernel, the second kernel has a throughput of 4 complex number, aligning with the vector width. The main distinctions are in the twiddles and the butterfly network:

The illustration for the second stage kernel: a single 4-point butterfly.

The generic stage is slightly more involved. In general, we would like to:

Re-use the twiddle factors
Keep the data aligned to the vector width
Maintain a reasonable register usage
Maintain a good ratio between arithmetic and data access operations

These requirements help to improve the efficiency of memory access and ALU usage. They also help to ensure that optimal numbers of work items can be dispatched at a time. With these requirements in mind, each work item for this kernel is responsible for 4 complex numbers for 4 butterflies. The kernel essentially operates on 4 partial butterflies and has a total throughput of:

4 complex number * 2 float * sizeof(float) * 4 partial butterflies = 1024 bit per work item

This is illustrated in the following graph:

The graph above represents the 8-point butterfly case for the generic kernel. The left side of the diagram shows 4 butterflies which associate with a work item. The red boxes on the left diagram highlight the complex elements being evaluated by the work item. The drawing on the right is a close-up view of a butterfly. The red and orange lines highlight the relevant information flow of the butterfly.

Instead of evaluating a single butterfly at a time, the kernel works on portions of multiple butterflies. This essentially allows the same twiddle factors to be re-used across the butterflies. For an 8-point transform as shown in the graph above, the butterfly would be distributed across 4 work items. The kernel is parameterizable from 8-point to N/2 point transform where N is the total length of the original input.

For the final stage, only a single butterfly of size N exists; twiddle factor sharing is not possible. Therefore, the final stage is just vectorized butterfly network that is parameterized to a size of N.

Performance

The performance of the 1D FFT implementation described in the last section is compared to a reference CPU implementation. In the graph below, the relative performance speed up is shown from 2⁶ to 2¹⁷ samples. Please note that the x-axis is on a log metric scale:

GPU FFT performance gain over the reference implementation.

We have noticed in our experiments that FFT algorithm performance tends to improve significantly on the GPU between about 4096 and 8192 samples The speed up continues to improve as the sample sizes grows. The performance gain essentially offsets the setup cost of OpenCL with large samples. This trend would be more prominent in higher dimension transformations.

Summary

FFT is a valuable tool in digital signal and image processing. The 1D FFT implementation presented can be extended to higher dimension transformations. Applications such as computational photography, computer vision and image compression should benefit from this. What is more interesting is that the algorithm scales well on GPUs. The performance will further improve with more optimization and future support of half-float.

↧

Mali Weekly Round-Up: From Supercomputers to Entry-Level

May 30, 2014, 10:34 am

≫ Next: From Keyboards to Touchscreens To....? The ‘Futuristic’ HMIs That Will Soon Be A Reality In Your Pocket.

≪ Previous: Optimizing Fast Fourier Transformation on ARM Mali GPUs

Europe to use ARM CPUs and GPUs to make an exaflop supercomputer 30-50x more energy efficient than the best supercomputers today

In 2011 the Mont Blanc project announced that it could design a higher performing, more energy efficient standard of computer architecture by using the processor technology found in today's embedded and mobile devices. Using the Exynos 5 Dual with its ARM® Mali™-T604 GPU as the basis of their prototype, their new machines are predicted to be able to carry out ten to the power of eighteen operations a second whilst being fifteen to thirty times more energy efficient than the systems used today and are set to completely revolutionize HPC technology.

Next Big Future covered the history of the project to date in this article.

New smartphone market data shows strong growth for entry-level and Android OS-based devices

Worldwide smartphone sales will reach 1.2bn by the end of 2014, an increase of 23.1% over the previous year, according to the latest report from IDC Research, and it is the entry-level market in emerging countries such as India, Indonesia and Russia that is especially drawing the attention of industry analysts. Average selling prices are starting to decrease with a smartphone now expected to sell at $314, but they are also offering far better value for this price as premium technology from previous years becomes affordable to the mass market. In addition, Android is set to continue its leadership, hitting an 80.2% market share by the end of 2014. This is encouraging news for Mali, whose GPU IP is found in over 50% of Android tablets and over 20% of all Android smartphones.

This week's Mali-based device launches

This week saw the launch of the Vodafone Smart Tab 4 into the UK market, an 8 inch tablet with a MediaTek MT8382 processor featuring a quad core ARM Cortex®-A7 CPU and a Mali-400 GPU. Acer announced four new Mediatek based tablets, featuring ARM Mali GPUs, set to be launched in the third quarter of this year. In addition the Alcatel OneTouch Idol X+ and the Wickedleak Wammy Neo with their Mali-450 GPUs were launched into the Indian market and the Huawei Honor 3C with its Mediatek MT6592 was launched in Pakistan.

↧

From Keyboards to Touchscreens To....? The ‘Futuristic’ HMIs That Will Soon Be A Reality In Your Pocket.

June 3, 2014, 3:02 am

≫ Next: ARM submits conformance for OpenGL ES 3.1

≪ Previous: Mali Weekly Round-Up: From Supercomputers to Entry-Level

Human Machine Interfaces (HMIs) are an incredibly important part of consumer electronics. A machine that is clumsy and unintuitive to interact with will rarely be a great success. As a result of this, if you look back over the past ten to fifteen years, it’s been one of the leading areas of innovation for the industry. Phones have evolved from having area-consuming PC-like keyboards to soft-control touchscreens and this change enabled larger screens and a better multimedia experience on pocket-sized mass-market mobile devices. In fact, the simplicity of the touchscreen has become so popular that it’s been adapted into many other areas where an easy to use natural user interface (NUI) is key, such as automotive multimedia systems or advanced medical applications.

In its most simple form it is easy to define HMI as “how a user interacts with their device”. However, innovations in HMI have much deeper influences than that. We can see that advancements in the area of HMI have not only changed how we interact with devices, but also what those devices do for us. It has been a key influencer in unlocking new functionality within our devices and what they mean in our lives. Phones are no longer used simply as a means of interchanging messages. We can now monitor our health, check the weather, play a vast variety of games on them, draw and edit pictures, or surf the internet at ease. The evolution of the HMI goes hand-in-hand with the evolution of a device’s multimedia content.

Today, richly graphical user interfaces, touchscreens and soft controls are the norm. To enable this, processors have had to evolve to offer the increased computational performance required of the device. From a graphical interface perspective, you can see GPU development has been driven by three key areas of demand:

Increasing resolutions: Since the Google Nexus 10 exceeded full HD 1080p resolution with its 2560x1600 (WQXGA) screen, OEMs are continuing to increase pixel density setting 4K (UHD) resolution as the new goal for mobile silicon. To enable this, ARM is not only increasing the number of potential shader core implementations within each GPU (the ARM® Mali™-T760 GPU can host up to sixteen) but we are also improving the internal efficiency of the cores themselves and the memory access in order to ensure that the scaling of cores results in a proportional scaling of performance.
Diversity of screen sizes: GPUs are suitable not only for HD tablets and UHD DTV displays but also for smaller, wearable devices and IoT applications. The growing diversity of consumer devices is encouraging semiconductor companies to deliver a correspondingly diverse GPU IP portfolio: a processor suitable for any application. With GPUs ranging from the area-efficient Mali-300 to the high-performing, sixteen core ARM Mali-T760 this is exactly what ARM is offering and we are continuing to evolve our roadmaps to deliver great graphical experiences on any device.
Hardware support for more complex content: as NUIs become increasingly life-like, hardware support for features such as the latest APIs becomes crucial in order to enable graphics-rich, intuitive content with smooth frame rates. Not only that, but the raw computational power needed in order to produce these smooth, life-like images that are expected in high end devices puts ever increasing demands on the capabilities of the processors in them. Again, that’s where the efficiency of ARM CPUs and GPUs come into play. Coupled with the configurability and scalability of ARM processors, device manufacturers have the flexibility they need to meet consumer demands, cost efficiently, across the entire market.

I believe the current phase of HMI is still being explored and will continue to see significant innovation. In the world of battery-powered devices, traditional PC games have been adapted from console and controller platforms to mobile. With this shift you can see some developers mimicking console controllers with the touchscreen, whilst others have achieved success with new, simple interfaces tailored to the nature of the game (such as swipes, tilts, etc.) This success is inspiring more developers to either design new applications for these effective HMIs, or even new HMIs tailored to their new game, making the entire multimedia experience ever more intrinsically interactive instead of conforming to traditional HMI methods.

However, that’s all happening today. What really excites me is what we can see coming in the future.

Across nearly all the evolutions in NUI, you can see a desire and trend for effortless and instinctive interaction. Physical push buttons have given way to soft buttons; fixed function devices have had their functionality opened up by this range of application-dependent software-driven controls. Looking into the near future, I can see the next phase of HMI arriving in the form of our devices “reaching out” to interact with us. Why should I have to remember an easily copied PIN sequence to unlock my device? Why can’t ‘I’ be the key? This trend is at the beginning of its lifecycle with facial recognition capabilities becoming standard in mobile devices and starting to be used for unlocking phones. As another example, why do we still have to find controllers every time we wish to change the channel or volume on the TV? Why can’t we control TVs ourselves via gesture or voice control? Why can’t the TV control itself, reaching out to see if anyone is watching it or whether the content is suitable for the audience (for example if there are children in the room)? As eyeSight’s CEO, Gideon Shmuel, says:

“In order for an interaction solution to be a true enhancement compared to existing UIs it must be simple, intuitive and effortless to control – and to excel even further, an interaction solution should become invisible to the user. This enhanced machine vision understanding will, in fact, deliver a user aware solution that understands and predicts the desired actions, even before a deliberate command has been given.”

The concepts for these new HMIs have existed for a while. But it is only in the past year that technology is starting to catch up in order to provide the desired result within the restricted mobile budget. In most cases when a device is “reaching out” to the user it is using either gesture recognition, motion detection or facial recognition. Two issues had been holding this back. Firstly, the processing budget for UI in embedded and mobile devices was not sufficient to support these pixel-intensive, computationally demanding tasks. Advancements such as GPU Compute, OpenCL™ and ARM big.LITTLE™ processing are addressing this issue, increasing the amount of processing possible within the same time budget, and several companies are seeing success in these areas.

Video interview with eyeSight

Secondly, I believe that the lack of a flexible, adaptable platform on which these tasks can be developed and matured was holding back this technology. However, now devices with performance-efficient GPU Compute technology are entering the market, such as the recently released Galaxy Note 3, and ARM is seeing an explosion in the number of third parties and developers exploring ways in which this new functionality can bring their innovations to life.

Looking even further ahead, it is clear that HMI will become even more complex as machines start to “reach out” to each other as well as to their users. As devices continue to diversify I believe that we will see a burst of innovation in how these devices start to interact and be used either in conjunction or interchangeably with each other. As the Internet of Things picks up its pace, the conversation will be about HMMI rather than simply HMI; then HMMMI. How will we interact with our devices when all our devices are connected? If my smartphone senses my hands are cold, will it automatically turn the room or car heating up? If I leave my house but accidentally leave the lights on, will they turn themselves off? Will the advancements in NUI on our mobile devices make obsolete any interactions on less interactive devices? Will we even need the mobile devices as an interface with the machine-world or will every device with a processor be able to “reach out” to its environment? The possibilities are vast in a user-aware world, and ARM’s role in this area will continue to be to develop the processor IP which enables continuous, ground breaking innovation.

What are your thoughts on the future of NUI? What will ARM have to do to meet its future needs? Let us know your thoughts in the comments below.

↧

ARM submits conformance for OpenGL ES 3.1

June 6, 2014, 9:21 am

≫ Next: Mali Weekly Round-Up: Two New Mali Processors and Conformance Submission

≪ Previous: From Keyboards to Touchscreens To....? The ‘Futuristic’ HMIs That Will Soon Be A Reality In Your Pocket.

This year at GDC Khronos announced the latest version of the OpenGL® ES API. OpenGL ES 3.1 is taking a step up from OpenGL ES 3.0 to enable new, fascinating mobile graphics content. With headline features such as compute shaders and indirect drawing, which Tom Olson, chair of the OpenGL ES Working Group, describes in detail in this very interesting blog Here comes OpenGL® ES 3.1!, application developers can now use this new API to deliver an even higher quality of graphics within the power constraints of mobile platforms. Here at ARM, we are fully committed to enabling our GPUs with the latest graphics and GPU Compute APIs as soon as possible. Today, Khronos finalised the conformance criteria less than three months after the official OpenGL ES 3.1 announcement and ARM is submitting for OpenGL ES conformance.

Conformance has just been submitted for the highly successful and market proven Mali-T604 and Mali-T628 GPUs as well as for the latest released high-end GPU, the Mali-T760. The first two power the graphics capabilities of bestseller products such as, but not limited to, the Samsung Galaxy S5, Galaxy Note 3, Google Nexus 10 and Galaxy Note Pro 12.2, while the Mali-T760 is expected to become available in commercial products within the next few months. Conformance will soon be submitted for our latest mid-range GPU, the Mali-T720 as well.

One of the key features of OpenGL ES 3.1 is the support for compute shaders. Developers can now use the compute capabilities of the GPU without having to use a different compute API and worry about the interoperability between graphics and compute. Seamlessly integrated in a single API, compute shaders can post-process the frame buffer output and implement astonishing visual effects with higher efficiency and lower complexity. It is also worth mentioning here that ARM has adopted GPU Compute from its very first steps and is creating a vibrant ecosystem of developers who are providing a number of innovative applications for Mali GPUs and establishing them as the de facto architecture for mobile GPU Compute.

A very good example of the life-like effects that can be implemented using the horsepower of OpenGL ES 3.1 running on Mali GPUs can be seen in the video below. In this demo, you can see the advanced physics simulation reflected in the motion of a hanging piece of cloth that gets blown by various shaped objects:

For interested readers, there is a blog Get started with compute shaders, which provides a complete background to this demo written by Sylwester Bala, one of ARM’s Senior Demo Developers.

OpenGL ES 3.1 is backwards compatible with OpenGL ES 3.0 and 2.0 making sure that the developer’s investment is protected, while a new set of features are provided such as enhanced texturing functionality that includes texture gather, multisample textures and stencil texture. Texture gather allows faster access to neighbouring texels while texture multisampling and stencil textures allow applications the same flexibility in texture processing as in render targets. These extra texture processing features enable crystal clear graphics to be smoothly displayed on a high resolution screen much more efficiently, which means longer battery life for mobile devices without any compromises in quality.Moreover, the enhanced shading language provides more built in functions to the developers, making their life simpler and increasing their productivity.

ARM is one of the first Khronos members to submit conformance for OpenGL ES 3.1 and we are dedicated to supporting our customers and ecosystem partners with the latest and greatest features that graphics technology has to offer. The power and flexibility of the Midgard architecture ensure our partners and developer ecosystem are always enabled with cutting edge technology that delivers best in class graphics within the tight power and area budget required for mobile devices.

↧

Mali Weekly Round-Up: Two New Mali Processors and Conformance Submission

June 6, 2014, 9:47 am

≫ Next: Interested in GPU Compute? You Have Choices!

≪ Previous: ARM submits conformance for OpenGL ES 3.1

The octa-core Huawei Kirin 920 chipset goes official

Today, Huawei launched an impressive new SoC housing four ARM® Cortex®-A15 CPUs clocked between 1.7 and 2.0GHz and four Cortex-A7 cores clocked between 1.3 and 1.6GHz in a big.LITTLE™ configuration. Designed for the high-end superphone market this latest SoC promises to offer an incredible user experience thanks to its powerful quad-core Mali™-T628GPU that is capable of breathtaking graphical displays, 3D gaming, visual computing, augmented reality, procedural texture generation and voice recognition.

GSM Arena covers the news in this article.

Mediatek announce the MT8127 SoC for quad-core tablets

At the end of last week, MediaTek announced a new chip specially designed to bring advanced multimedia features, outstanding performance and low power consumption to the super-mid market at an agreeable pricepoint. The MT8127 SoC features a quad-core ARM Cortex-A7 processor clocked at 1.5GHz along with a quad-core ARM Mali-450 GPU to enable seamless Full HD video playback. The announcement also included information on a future MT8127-powered device, the ALCATEL ONETOUCH PIXI 8 tablet.

The full press release is available on Mediatek's website.

ARM submits conformance for OpenGL ES 3.1

Also today, the Khronos Group finalized the conformance criteria for the latest version of the OpenGL® ES API, OpenGL ES 3.1. ARM has already submitted conformance for three of its GPUs: the ARM Mali-T604, Mali-T628 and Mali-T760. For full information on this announcement, read Plout Galatsopoulos' blog ARM submits conformance for OpenGL ES 3.1.

Any new devices launched?

This week saw the launch of the XoloQ1200 smartphone, the next in the series from the Indian smartphone brand. It comes with some cool new apps including gesture controls, voice recognition, float task with dual window feature, cold access apps, and smart reading mode, all powered by a 1.3 GHz quad-core MediaTek MT6582 processor with a dual-core Mali-400 MP GPU.

More information on the launch can be found in this article.

↧

Interested in GPU Compute? You Have Choices!

June 9, 2014, 9:12 am

≫ Next: Mali GPU and Samsung AMOLED combine for picture perfect real-time images

≪ Previous: Mali Weekly Round-Up: Two New Mali Processors and Conformance Submission

The most notable addition to OpenGL® ES when version 3.1 was announced at GDC earlier this year was Compute Shaders. Whilst similar to vertex and fragment shaders, Compute Shaders allow much more general-purpose data access and computation. These have been available on desktop OpenGL® since version 4.3 in mid-2012, but it’s the first time they’ve been available in the mobile API. This brings another player to the compute-on-mobile-GPU game, joining the ranks of OpenCL^™, RenderScript and others. So what do these APIs do and when should you use them? I’ll attempt to answer these questions in this blog.

When it comes to programming the GPU for non-graphics related jobs, the various tools at our disposal share a common goal: to provide an interface between the GPU and CPU so that packets of work to be executed in parallel can be applied to the GPU’s compute resources. Designing tools that are flexible enough to do this and that allow the individual strengths of the GPU’s architecture to be exploited is a complex process. The strength of the GPU is to run small tasks on a wide range of data as far as possible in parallel, often many millions of times. This is after all what a GPU does when processing pixels. Compute on the GPU is just generalizing this capability. So inevitably there are some similarities in how these tools do what they do.

Let’s take a look at the main options…

OpenCL

Initially developed by Apple and subsequently managed by the Khronos Group, the OpenCL specification was released in late 2008. OpenCL is a flexible framework that can target many types of processor, from CPUs and GPUs to DSPs. To do so you need a conformant OpenCL driver for the processor you’re targeting. Once you have that, a properly written OpenCL application will be compatible with other suitably conformant platforms.

When I say OpenCL is flexible, I was perhaps understating it. Based on a variant of C99, it is very flexible, allowing complex algorithms to be shaped across a wide variety of parallel computing architectures. And it has become very widespread – there are drivers around there for hundreds of platforms. See this list for the products that have passed the Khronos conformance tests. ARM supports OpenCL with its family of ARM® Mali™ GPUs. For example Mali-T604 passed conformance in 2012.

So is there a price for all this flexibility? Well, it can be reasonably complex to set up an OpenCL job… and there can be quite an overhead in doing so. The API breaks down access to OpenCL-compatible devices into a hierarchy of sub units.

So the host computer can in theory have any number of OpenCL devices. Each of these can have any number of compute units and in turn, each of these compute units can have any number of processing elements. OpenCL workgroups – collections of individual threads called work items – run on these processing elements. How all of this is implemented is platform dependent as long as the end result is compliant with the OpenCL standard. As a result, the boilerplate code to set up access to OpenCL devices has to be very flexible to allow for so many potential variations, and this can seem significant, even for a minimal OpenCL application.

There are some great samples and a tutorial available in the ARM Mali OpenCL SDK, with a mix of basic through to more complex examples.

From the earliest days of OpenCL targeting mobile GPUs the API has showed shown great promise, both in terms of accelerating performance and in reduced energy consumption. Many of these have concentrated on image and video processing. For an example, see this great write-up of the latest software VP9 decoder from Ittiam.

For more examples of some of the developments using OpenCL on mobile, check out Mucho GPU Compute, amigo! from Roberto Mijat.

One of the real benefits of OpenCL, as well as its flexibility, is the huge range of research and developer activity surrounding the API. There are a large number of other languages – more than 70 at the last count – that compile down to OpenCL, easing its use and allowing its benefits to be harnessed in a more familiar environment. And there are several CL libraries and numerous frameworks exposing the OpenCL API from a wide range of languages. PyOpenCL, for example, provides access to OpenCL via Python. See Anton Lokhmotov's blog on this subject Introducing PyOpenCL.

Because of the required setup and overhead, building an OpenCL job into a pipeline is usually only worth doing when the job is big enough, at the point where this overhead becomes insignificant against the work being done. A great example of this was Ittiam System’s recent optimisation of their HEVC and VP9 software video decoder. As not all of the algorithm was suitable for the GPU, Ittiam had to choose how to split the workload between the CPU and GPU. They identified the motion estimation part of the algorithm as being the most likely to present enough parallel computational work to benefit from running on the GPU. The algorithm as a whole is then implemented as a heterogeneous split between the CPU and GPU, with the resulting benefits of reduced CPU workload and reduced overall power usage. See this link for more about Ittiam Systems. Like most APIs targeting a wide range of architectures, optimisations you make for one platform might need to be tweaked on another, but having the flexibility to address the low level features of a platform to take full advantage of it is one of OpenCL’s real strengths.

Recent Developments

It’s been a busy year so far for Khronos and OpenCL – there have been a number of developments. Of particular note perhaps is the announcement of version 1.0 of WebCL™, an API that does for OpenCL what WebGL™ does for OpenGL ES by exposing the compute API to JavaScript and bringing compute access into the world of the browser. Of course, support within browsers may take some time – as it did for WebGL – but it’s a sign of OpenCL broadening its appeal.

OpenCL Summary

OpenCL provides an industry standard API that allows the developer to optimise for a supporting platform’s low level architectural features. To help you get going there is a large and growing number of developer resources from a very active community. If the platform you’re planning to develop for supports it, OpenCL can be a powerful tool.

RenderScript

RenderScript is a proprietary compute API developed by Google. It’s been an official part of Android™ OS since the Honeycomb release in July 2011. Back then it was intended as both a graphics and a compute API, but the graphics part has since been deprecated. There are several similarities with OpenCL… it’s based on C99, has the same concept of organising data into 1, 2 or 3 dimensions etc. For a quick primer on RenderScript, see GPU Computing in Android? With ARM Mali-T604 & RenderScript Compute You Can! by Roberto Mijat or Google’s introduction to RenderScript here.

The process of developing for RenderScript is relatively straightforward. You write your RenderScript C99-based code alongside the Java that makes up the rest of your Android application. The Android SDK creates some additional Java glue to link the two together, and compiles the RenderScripts themselves into bitcode, an intermediate, device-independent format that is bundled with the APK. When the device runs, Android will determine what RenderScript devices are available and capable of running the bitcode in question. This might be a GPU (e.g. Mali-T604) or a DSP. If one is found, the bitcode is passed onto a driver that creates appropriate machine-level code. If there is no suitable device, Android will default back to running the RenderScript on the CPU.

In this way RenderScript is guaranteed to run on just about any Android device, and even with fallback to the CPU it can provide a useful level of acceleration. So if you are specifically looking for compute acceleration in Android, RenderScript is a great tool.

The very first device with GPU-accelerated RenderScript was Google’s Nexus 10, which used an SoC featuring an ARM Mali-T604 GPU. Early examples of RenderScript applications have shown a significant benefit from using accelerated GPU compute.

As a relatively young API, RenderScript knowhow and examples are not as easy to come by compared to OpenCL, but this is likely to increase. There’s more detail about how to use RenderScript here.

RenderScript Summary

RenderScript is a great way to benefit from accelerated compute in the vast majority of Android devices. Whether this compute is put onto the GPU or not will depend on the device and availability of RenderScript GPU drivers, but even when that isn’t the case there should still be some benefit from running RenderScripts on the CPU. It’s a higher-level API than OpenCL, with fewer configuration options, and as such can be easier to get to grips with, particularly as RenderScript development is streamlined into the existing Android SDK. If you have this setup, you already have all the tools you need to get going.

Compute Shaders

So to the new kid on the block, OpenGL ES 3.1 compute shaders. If you’re used to using vertex and fragment shaders already with OpenGL ES, you’ll fit right in with compute shaders. They’re written in GLSL (OpenGL Shading Language) in pretty much the same way with similar status, uniforms and other properties and have access to many of the same types of data including textures, image types, atomic counters and so on. However, unlike vertex and fragment shaders they’re not built into the same program object and as such are not a part of the same rendering pipeline.

Compute shaders introduce a new general-purpose form of data buffer, the Shader Storage Buffer Object, and mirror the ideas of work items and workgroups used in OpenCL and RenderScript. Other additions to GLSL allow work items to identify their position in the data set being processed and allow the programmer to specify the size and shape of the workgroups.

You might typically use a compute shader in advance of the main rendering pipeline, using the shader’s output as another input to the vertex or fragment stages.

Though not part of a rendering pipeline, compute shaders are typically used to support them. They’re not as well suited to general purpose compute work as OpenCL or RenderScript - but assuming your use-case is suitable, compute shaders offer an easy way to support access to general purpose computing on the GPU.

For a great introduction to Compute Shaders, do see Sylwester Bala's recent blog Get started with compute shaders.

Compute Shaders Summary

Compute shaders are coming! How quickly depends on the role-out and adoption of OpenGL ES 3.1, but there’s every chance this technology will find its way into a very wide range of devices as mobile GPUs capable of supporting OpenGL ES 3.1 filter down into the mid-range market over the next couple of years. The same thing happened with the move from OpenGL ES 1.1 to 2.0… nowadays you’d be hard pushed to find a phone or tablet that doesn’t support 2.0. Relative ease of use combined with growing ubiquity across multiple platforms could just be a winning combination.

See Plout Galatsopoulos' blog on ARM's recent submission for OpenGL ES 3.1 conformance for the Mali-T604, Mali-T628 and Mali-T760 GPUs - and for a great introduction to OpenGL ES 3.1 as a whole, do check out Tom Olson's blog Here comes OpenGL® ES 3.1!

One more thing…

So that’s it. But as Columbo would say… “just one more thing…”

OpenGL ES 2.0 Fragment Shaders and Frame Buffer Objects

Although not seen as a power compute user’s weapon of choice, fragment shaders have for a long time been used to run some level of general compute - and they do offer one benefit unique amongst all the main approaches here: ubiquity. Any OpenGL ES 2.0-capable GPU – and that really is just about every smart device out there today – can run fragment shaders. This approach involves thinking of texture maps not necessarily as arrays of texels, but just as a 1D or 2D array of data. As long as the data to be read and written by the shader can be represented by supported texture formats, these values can be sampled and written out for each element in the array. You just set up a Frame Buffer Object and typically render a quad (two triangles making a rectangle) into it, using one or more of these data arrays as texture sources. The fragment shader can then compute more or less whatever it wants from the data in these textures, and output the computed result to the FBO. The resulting texture can then be used as a source for any other fragment shaders in the rendering pipeline.

Summary

In this blog I’ve looked at OpenCL, RenderScript, Compute Shaders and fragment shaders as several options for using the GPU for non-graphical compute workloads. Each approach has characteristics that will suit certain applications of developer’s requirements, and all of these tools can be leveraged to both improve performance and reduce energy consumption. It’s worth noting that the story doesn’t stop here. The world of embedded and mobile heterogeneous computing is evolving fast. The good news is that the Mali GPU architecture is designed to support the latest leading compute APIs, enabling all our customers to achieve improved performance and energy efficiency on a variety of platforms and operating systems.

↧

Mali GPU and Samsung AMOLED combine for picture perfect real-time images

June 17, 2014, 8:54 am

≫ Next: Huawei Chooses ARM Mali GPUs for its Premium Smartphone Offering

≪ Previous: Interested in GPU Compute? You Have Choices!

Modern games and applications really push the boundaries of real-time graphics and user interfaces on mobile and to do this they need all the components of the system to work together to provide the performance those apps need.

But it’s not all about performance; not only do mobile users demand desktop equivalent features, they want it at desktop equivalent quality too! It’s not just enough to push lots of pixels around, they need to be high quality pixels! Don't get me wrong, better performance allows developers to make use of advanced shader techniques to add high quality visual special effects, more detailed geometry in their 3D scenes and more animated objects, such as particles for simulating explosions and weather. However, there are things other than performance that can influence the visual quality in your latest apps and games.

Last week Samsung announced their new Galaxy Tab S with AMOLED display. This is great for users; its vibrant colours and thin design really help improve the user experience. But great displays need great images to start with! This is where the ARM® Mali™ GPU comes in; it accelerates the rendering of apps and games on your mobile. And, the new Galaxy Tab S just happens to have our current flagship GPU, the Mali-T628 MP6.

Mali-T628 MP6 GPU contains lots of features that help improve the quality of the images;especially real-time 3D graphics used in high-end games. For the techies out there, take ETC2 supported in OpenGL ES 3.0 for example. ETC2 allows the compression of images that contain an opacity component, allowing for higher quality foliage in games. And, how about Adaptive Scalable Texture Compression (ASTC) - the texture compression format designed by ARM and adopted across the industry. It enables high quality compression of images with a much wider range of supported formats. Textures in games can now be much higher quality while retaining small file sizes (which you’ll know about if you’ve ever had to wait while your favourite game downloads to your phone!)

Texture compression allows us to save memory bandwidth and further improve visual quality by employing higher resolution textures. Keeping with the theme of bandwidth savings, there’s also ARM’s proprietary Transaction Elimination technology. With no extra effort from the developer (it all happens automatically in the background), bandwidth savings can be made by only updating areas of the screen that have actually changed; again, bandwidth resource that the developer can employ elsewhere to make further improvements to visual quality.

Anti-aliasing is another technology that ARM has always employed to improve the visual quality of the images you see in your games and apps. Even at high resolutions, aliasing can be an issue but the Mali range of GPUs can perform anti-aliasing with minimal impact; all developers need to do is turn it on!

We touched on OpenGL ES 3.0 earlier and for my last point, I’d like to mention it again. With OpenGL ES 3.0 that is in devices now, developers can make use of higher dynamic range formats, both for textures and render targets. Meaning source textures and rendered scenes can make use of the extra colour information that HDR techniques provide.

For a long time now, ARM has driven innovation and visual quality in the graphics industry and the future is no exception. Coming soon we will have ARM Framebuffer Compression (AFBC)and Smart Composition; both technologies help reduce memory bandwidth, allowing developers the freedom to improve those pixels even more!

To create compelling picture-perfect visual experiences, developers don’t just throw pixels on the screen, lots of hard work goes into every tiny detail, every leaf on a tree, every curve on a super car and every scar on an action hero’s chin. They all look great on a high quality display but when you combine that with the performance and quality of Mali GPUs that’s when the images really begin to pop!

↧

Huawei Chooses ARM Mali GPUs for its Premium Smartphone Offering

June 27, 2014, 2:14 am

≫ Next: ARM’s Mali Midgard Architecture Explored

≪ Previous: Mali GPU and Samsung AMOLED combine for picture perfect real-time images

This week saw the exciting launch of the latest ARM® Mali™-T628 MP4 GPU based product, Huawei Technologies Honor 6. With its Hisilicon Kirin 920 processor based on ARM big.LITTLE™ processing technology and promising graphics benchmark scores, this latest offering from Huawei is making huge ripples in the Chinese smartphone market.

An impressive array of chips has been coming out of China in recent months. From the Rockchip RK3288 featuring the Mali-T760 through to the MediaTek MT6732, Chinese semiconductor companies are meeting the growing domestic demands for high performing yet cost efficient smartphones head on. Over the past two years smartphone shipments in China have nearly quadrupled and the market is starting to mature. Like the majority of smartphone markets, it is now one of two ends. While 1 billion consumers in China have a phone, only about 40% of these are smartphone owners, largely due to budget constraints. This leaves a potential market of 600 million customers for the OEMs who can deliver a desirable yet cost-effective device. At the other end of the scale, and where the latest release from Huawei falls, we have those who desire a premium superphone whose user experience sets the standard for the industry. While competition is strong from international products such as Apple’s iPhone 5S and Samsung’s Galaxy Note 3, shipments from local suppliers are increasing rapidly as their offerings become increasingly competitive with the combined benefit of lower prices.

HiSilicon’s latest chip, the Kirin 920, is one that can comfortable rival these competitors. With a quad-core ARM Cortex®-A15 and quad-core Cortex-A7 in a big.LITTLE processor configuration it offers both high performance for more intensive workloads and energy efficiency for day to day tasks – a truly heterogeneous approach. Combined with the Mali-T628 MP4, the processor is capable of not only driving a stunning 2560x1600 resolution display, 4K video capture and playback and a 13MP camera with HDR support, but also achieving some impressive early results in the AnTuTu benchmark as highlighted in the launch presentation.

On top of it all, it is the first LTE device to support LTE Cat6 in WW and support the maximum, super-fast download speed of 300MB/s.

Devices such as the Huawei Honor 6 will accelerate the expansions of the Android Gaming and GPU Compute ecosystems in China. The gaming experience on these devices with such high levels of processing technology promises to be incredible and as they succeed in reaching the hands of more consumers it will encourage developers to create even more graphically challenging, visually impressive applications to enrich the user experience further. The Huawei Honor 6 is yet another proof point that the Chinese marketplace is the one to be watching.

↧

ARM’s Mali Midgard Architecture Explored

July 4, 2014, 4:20 am

≫ Next: Mali Weekly Round-Up: Top Tech and 64-bit Platforms

≪ Previous: Huawei Chooses ARM Mali GPUs for its Premium Smartphone Offering

Hi, all

There is a paper ARM’s Mali Midgard Architecture Explored posted in AnandTech. You can read this paper through the following link.

AnandTech | ARM’s Mali Midgard Architecture Explored

↧

Mali Weekly Round-Up: Top Tech and 64-bit Platforms

July 4, 2014, 9:08 am

≫ Next: Mali Midgard meets the public and the analysts

≪ Previous: ARM’s Mali Midgard Architecture Explored

After a little hiatus in these blogs while I took a trip to Scotland, these blogs are back on the road! So, what has happened in the ARM Mali world this week?

ARM and Geomerics recognized in the Develop 100 Tech List

Develop have published the ultimate list of tech that will influence the future of gaming. Including a multitude of upcoming platforms and the tools, engines and middleware required to make great games run excellently, the list is well worth a read. The ARM-based Raspberry Pi took pride of place in the top spot, followed shortly afterwards by Geomerics Enlighten technology at #8, which has recently been integrated into Unreal Engine 4 and Unity 5 (#2 and #3 in the Tech List respectively). It was great to see the Mali Developer Center at #33 being recognized for its efforts to help developers fully utilize the opportunities the mobile market offers through its broad range of developer tools, supporting documentation and broad collaboration across the gaming industry.

ARM release 64-bit hardware development platform, "Juno"

Following the announcement of 64-bit Android at Google IO, this week Linaro announced a port of the Android Open Source Project (AOSP) to the ARMv8-A architecture. At the same time, the Juno development board was released, sporting a dual-core ARM Cortex-A57 CPU and quad-core Cortex-A53 in big.LITTLE configuration, plus a quad core ARM Mali-T624 GPU for 3D graphics acceleration and GPU Compute support. Altogether, this provides the ARM Ecosystem with a strong foundation on which we can accelerate Android availability on 64-bit silicon and drive the next generation of Android mobile experiences.

Ask the experts: Jem Davies answers your questions on AnandTech

The CPU folk did their "Ask the Experts" a while back and now it's the turn of the GPU to have the spotlight! This week ARM's Jem Davies has been answering your questions on AnandTech with topics ranging from mobile versus desktop graphics features to GPU Compute and the future of graphics APIs. Jem is also starring in a Google Hangout on AnandTech next week, Monday 7 July - tune in for what is set to be an informative and detailed debate.

Good luck to all teams competing in the Brains Eden Game Jam this weekend!

Following our "Developing for Mobile" Workshop at Brains Eden, the teams in Cambridge will set to tomorrow to develop the best game of the weekend. With the importance of mobile gaming growing rapidly, all teams have been given access to a Google Nexus 10 and the expertise of ARM experts who are present throughout the event to guide students in how to get the best performance from these devices. Details of how they get along will be reported next week!

↧

Mali Midgard meets the public and the analysts

July 9, 2014, 1:57 am

≫ Next: Realizing the Benefits of GPU Compute for Real Applications with Mali GPUs

≪ Previous: Mali Weekly Round-Up: Top Tech and 64-bit Platforms

It's been a busy time here at Mali-central recently and I have been too busy even to blog about it.

I did an Ask The Experts slot with AnandTech recently, and faced some very interesting questions from the public. You might like to take a look.

We also bared our soul and Ryan Smith wrote a very detailed article about the Mali Midgard architecture.

Then, finally, I faced Anand Shimpi from AnandTech and did a live interview as a Google hangout. The whole thing was recorded and put up on YouTube. You can see it here. I know what I meant to say but what came out of my mouth of course didn't always match that . Oh well...

Enjoy...

↧

Realizing the Benefits of GPU Compute for Real Applications with Mali GPUs

July 10, 2014, 5:42 am

≫ Next: Mali Weekly Round-Up: Geomerics Set To Enter The Film World and Mali-450 Momentum Continues

≪ Previous: Mali Midgard meets the public and the analysts

I have just returned from a fortnight spent hopping around Asia in support of a series of ARM hosted events we call the Multimedia Seminars, which took place in Seoul (27^th June), Taipei (1^st July) and Shenzhen (3^rd July). Several hundred attendees joined in each location, a quality-dense cross-section from the local semiconductor and consumer industries, including many silicon vendors, OEMs, ODMs and ISVs. All of them were able to hear the great progress made by the ARM ecosystem partners who are developing the use of GPU Compute on ARM® Mali™ GPUs. In this blog I will try to summarise some of the highlights.

The Benefits of GPU Compute

In my presentation at the three sites I was able to illustrate the benefits of GPU Compute using Mali. This was an easy task as at the event many independent software vendors were demonstrating and promoting a vast selection of middleware ported and optimized for Mali.

But what are the benefits of GPU Compute?

Reduced power consumption. The architectural characteristics of the Mali-T600 and Mali-T700 series of GPUs enable computation of many parallel workloads much more efficiently than alternative processor solutions. GPU Compute accelerated applications can therefore benefit by consuming less energy, which translates into longer battery life.
Improved performance and user experience. Where raw performance in the target, the computation of heavy parallel workloads can also be significantly accelerated through the use of the GPU. This may translate in increased frame rate, or the ability to carry out more work in the same temporal/power budget, and can result in benefits such as improved UI responsiveness, more robust finger detection for gesture UIs in challenging lighting conditions, more accurate physics simulation, the ability to apply complex pre-/post-processing effects to multimedia on-device and in real-time. In essence: a significantly improved end-user experience.
Portability, programmability, flexibility. Heterogeneous compute APIs such as OpenCL™ and RenderScript, are designed for concurrency. They allow the developer to migrate some of the load from the CPU to the GPU or other accelerator, or to distribute it between processors in order to enable better load-balancing across system resources. For example a video codec may offload motion vector calculations to the GPU, enabling the CPU to operate with fewer cores and at lower frequencies, or to be available to compute additional tasks, for example video analytics.
Reduction of cost, risk and time to market. System designers may be influenced by various cost, flexibility and portability concerns when considering migrating functionality from dedicated hardware accelerators to software solutions which leverage the CPU/GPU subsystem. This approach is made viable and compelling due to the additional computational power provided by the GPU, now exposed through industry standard heterogeneous compute APIs.

Over the last few years ARM has worked very hard to create and develop a strong GPU Compute ecosystem. Collaborations were established across geographies, use-cases and applications, working with partners at all levels of the value chain. These partners were able to translate the benefits of GPU Compute into reality, to the ultimate avail of the end users, and were proudly showcasing their progress at the Multimedia Seminars.

Demonstrating Reduced Power Consumption

Software codec vendors such as Ittiam Systems have been demonstrating for some time HEVC and VP9 optimized ports that make use of GPU Compute on Mali-T600 series GPUs. A software solution leveraging the CPU+GPU compute subsystem can be useful for reducing TTM, reducing risk in the adoption of new standards, but most importantly, it can help to save power.

For the first time ever Ittiam Systems publically demonstrated how a software solution leveraging on the CPU+GPU compute subsystem is able to save power compared to a solution that does not make use of the GPU. Using an instrumented development board and power probing tools connected to a National Instruments DAQ unit they were able to demonstrate a typical reduction in power consumption of over 30% for 1080p30 video playback.

Mukund Srinivasan, Director and General Manager of the Consumer and Mobility Business Unit at Ittiam, said: "These paradigm shifts open a unique window of opportunity for focused media-related Intellectual Property providers like Ittiam Systems® to offer highly differentiated solutions that are not only compute efficient but also enhance user experience by way of a longer battery life, thanks to offloading significant compute to the ARM Mali GPU. The main hurdle to cross, in these innovative solutions, comes in the form of how to manage the enhanced compute demands of a complex codec standard on mobile devices, since it bears significantly more complex coding tools, for codecs like H.265 or VP9, as compared to VP8 or H.264. In order to gain the maximum efficiencies offered by the GPU technology, collaborating with a longstanding partner and a technology pioneer like ARM enabled us to generate original solutions to the complex problems posed in the design and implementation of consumer electronic systems. Working closely with ARM, we have been able to showcase not just a prototype or a demo, but a real working product delivering significant power savings, of the order of 30-35% improvement in energy efficiencies, when measured on the whole subsystem, using the ARM Mali GPU on board a silicon chip designed for use in the Mobile market"

Demonstrating reduced CPU Load

Another benefit of GPU Compute was illustrated by our partner Thundersoft, who have implemented a gender-based real-time facebeautifier application and improved its performance using RenderScript on a Mali-T600 GPU. The algorithm first detects the subject’s face and determines its gender, and based on the gender applies a chain of complex image processing filters that enhances the subject’s appearance. This include face whitening, skin tone softening, de-blemishing effects.

The algorithm is very computational intensive and very taxing on the CPU resource, which can at times result in poor responsiveness. Furthermore, a performance of 20+ FPS is required in order to deliver a good user experience and this is not achievable on the CPU alone. Fortunately, the heavy level of data parallelism, and large proportion of floating point and SIMD-friendly operations, make this use case great for GPU acceleration. Using RenderScript on Mali, Thundersoft were able to improve the performance from a poor 10fps to over 20fps, and at the same time reduce the CPU load from fluctuating between 70-100% to a consistent < 40%. Dynamic power reduction techniques are therefore able to disable and scale down operational points of the CPUs in order to save power.

Delivering improved Performance and User Experience

Image processing is proving to be a very fertile area for GPU Compute processing. In their keynote and technical speech, ArcSoft illustrated how they utilised Mali GPUs to improve many of their algorithms including JPEG, photo filters, beautification, Video HDR (NightHawk), and HEVC.

A “nostalgia effect” filter, based on convolution, was optimized using OpenGL® ES. For a 1920x1080 camera preview image, the rendering time was reduced from 80ms down to 20ms using the Mali GPU. This means going from 12.5fps to 50fps.

Another application that benefited is ArcSoft’s implementation of a face beautifier. The camera stream was processed by the CPU, whilst colour-conversion and rendering was moved from the CPU to the GPU. Processing time for a 1920x1080 frame was therefore reduced from 30ms to just 10ms. In practice this meant that the face beautification frame rate was improved from 16fps to 26fps!

Another great example is JPEG processing. OpenCL was used for reconstructing inverse quantization and IDCT modules. Compared with ArcSoft’s Neon based JPEG decoder, performance in decoding 4000x3000 resolution images inproved 25%. Compared with OpenCL based open-source project JPEG-OpenCL, the efficiency of IDCT increased as much as 15 times.

Improved User Experience for Computer Vision Applications

You may have previously seen our partner eyeSight Technologies demonstrate how they have been able to improve the robustness and reliability of their gesture UI engine. Gesture UIs are particularly challenged when lighting conditions are poor, as this adds a lot of noise to the sensor data, and reduce accuracy of detection of gestures. As it happens, poorly lit situations is common when gesture UIs are typically used, such as inside a car, or in a living room. GPU Compute significantly increases the amount of useful computation that the gesture engine can carry out on the image within the same temporal and energy budget, this enables a significant improvement of reliability of gestures when lighting is poor.

eyeSight machine vision algorithms make extensive use of machine learning (using neural networks). The capability to learn from a vast amount of data, at a reasonable amount of time is a key element for success. However, the required
computational resources of neural networks, are beyond the capabilities of standard CPUs. eyeSight’s utilization of deep learning methods can greatly benefit from running on GPU processors.

eyeSight have used their extensive knowledge of machine vision technologies and the ARM Mali GPU Compute architecture to optimized their solutions using OpenCL on Mali.

Alva demonstrate video HDR and real-time video stabilization

In its presentation, Oscar Xiao, CEO of Alva Systems, discussed the value of heterogeneous computing for camera based applications, using two examples: real-time video stabilization and HDR photography. Alva optimized their solutions for Mali-400, Mali-450 and Mali-T628. Implementations of their algorithms are available using OpenGL ES and OpenCL APIs. Through the use of the GPU, image stabilization can be carried out comfortably for 1080p video
streams at 30fps and above. Alva Systems have also implemented an advanced HDR solution that corrects image distortion (common in multi-frame processing algorithms), removes ghosting and carries out intelligent tone mapping (to enable a more realistic result). Of course all of these features increase the computational requirements of the algorithm. GPU Compute enables real-time computation. Alva were able to measure performance improvement of individual blocks of around 13-15x compared to the reference CPU implementation of the same algorithm.

In Conclusion

Modern compute APIs enable efficient and portable heterogeneous computing. This includes enabling the use of the best processor for the task, the ability to balance workloads across system resources and to offload heavy parallel computation to the GPU. GPU Compute with ARM Mali brings tangible advantages for real world applications, including reduced cost and time to market, improved performance and user experience, and improved energy efficiency (measured on consumer devices). These benefits are being enabled by our ecosystem partners who use GPU Compute on Mali for a variety of applications including: advanced imaging, computer vision, computational photography and media codecs.

Industry leaders take advantage of the capabilities of ARM Mali GPUs to innovate and deliver - be one of them!

↧

Mali Weekly Round-Up: Geomerics Set To Enter The Film World and Mali-450 Momentum Continues

July 11, 2014, 9:48 am

≫ Next: New Releases: Mali Graphics Debugger v1.3 and Mali Offline Shader Compiler v4.3

≪ Previous: Realizing the Benefits of GPU Compute for Real Applications with Mali GPUs

UK Govt Backs Geomerics to Revolutionize the Movie Industry

Earlier this week, Geomerics announced that it has won a £1million award from the UK's Technology Strategy Board (TSB) for it to bring its real-time graphics rendering techniques from the gaming world to the big screen. Geomerics and its partners will help the film and television services industry become more efficient by decreasing the amount of time spent in rendering, particularly for lighting which is one of the most time consuming parts of the editing process. Traditionally all editing was done offline then rendered to bring them to full quality, with the rendering taking 8-12 hours. However, the gaming world has developed techniques that allow full quality graphics graphics sequences to be rendered instantly - and Geomerics is looking to bring them to the film world.

For more information on the technology behind the announcement, check out the Geomerics website.

Hardkernel Release the Odroid-XU3 Development Board

Based on Samsung's Exynos 5422 SoC with its ARM Mali-T628 MP6 GPU this new development board from Hardkernel offers a heterogeneous multiprocessing solution with great 3D graphics and thanks to its open source support, the board can run various flavours of Linux, including the latest Ubuntu 14.04 and the Android 4.4.

Full details on the board are available on the Hardkernel website and it also got a great article on Linux Gizmos.

Today Was Clearly The Day of the Mali-450 MP GPU

Four devices were announced today featuring the Mali-450, two of which had the Mediatek MT6592 at its heart and two with a HiSilicon SoC. The Mali-450 has been picking up momentum over the past six months and now we are starting to see it in a range of smartphones, such as the HTC Desire 616 and the Wickedleak Wammy Neo Youth - as well as tablets such as the HP Slate 7 VoiceTab Ultra and HP Slate 8 Plus.

Aricent on Architecting Video Software for Multi-core Heterogeneous Platforms

If you haven't caught it already, our GPU Compute partner Aricent posted a great blog on their section of the community, Parallel Computing: Architecting video software for multi-core heterogeneous platforms. It covers conventional techniques used by software designers to parallelize their code and then proposes the novel "hybrid and massive parallelism based multithreading" model as a potential way to overcome the shortcomings of spatial and functional splitting. It's definitely worth a read if you're interested in programming for multi-core platforms.

↧

Per-Render Target Rendering: Quick Recap

Open GL ES: What is a Render Target?

On-screen Render Targets

Off-screen Render Targets

Render Target Flush Inference

When to call glClear?

When to call glInvalidateFramebuffer?

Summary

Footnotes

Why did we use ASTC?

Demo Overview

Transform Feedback for physics simulation

Instancing for efficiency

The smoke scene

Blurring the smoke

Bonus track

References:

Chukong Technologies and ARM raise the bar for mobile graphics performance

ARM expects ~1bn entry level smartphones in 2018, $20 smartphones coming this year

The ARM chips behind the wearable revolution

Huawei Ascend P7 Hands-on Review

Other Mali-based devices released this week include...

ARM Launches v.4.2 of the Offline Shader Compiler

Samsung Galaxy S5: Powered by Exynos

Android TV boxes: Tronsmart Vega S89 vs. Rikomagic M902

Xiaomi unveils a 4K, 3D Android-powered TV

What is EGL_BUFFER_PRESERVED?

Great, I can render only what changed!

Is EGL_BUFFER_PRESERVED worth using?

A Better Future?

Tune In Next Time

Further reading:

Footnotes

ARM and Cadence EDA Technology Access Agreement

SONIQ Launch Three SmartTVs Customized for the Chinese Market

Doogee's Solutions for Mid-Range Mobiles are Put to the Test

New Mali-based Mini-PC on the Market

Europe to use ARM CPUs and GPUs to make an exaflop supercomputer 30-50x more energy efficient than the best supercomputers today

New smartphone market data shows strong growth for entry-level and Android OS-based devices

This week's Mali-based device launches

The octa-core Huawei Kirin 920 chipset goes official

Mediatek announce the MT8127 SoC for quad-core tablets

ARM submits conformance for OpenGL ES 3.1

Any new devices launched?

OpenCL

Recent Developments

OpenCL Summary

RenderScript

RenderScript Summary

Compute Shaders

One more thing…

Summary

ARM and Geomerics recognized in the Develop 100 Tech List

ARM release 64-bit hardware development platform, "Juno"

Ask the experts: Jem Davies answers your questions on AnandTech

Good luck to all teams competing in the Brains Eden Game Jam this weekend!

The Benefits of GPU Compute

Demonstrating Reduced Power Consumption

Demonstrating reduced CPU Load

Delivering improved Performance and User Experience

Improved User Experience for Computer Vision Applications

Alva demonstrate video HDR and real-time video stabilization

In Conclusion

UK Govt Backs Geomerics to Revolutionize the Movie Industry

Hardkernel Release the Odroid-XU3 Development Board

Today Was Clearly The Day of the Mali-450 MP GPU

Aricent on Architecting Video Software for Multi-core Heterogeneous Platforms