ARM enables smartphone-quality visuals on wearable and IoT devices with the Mali-470 graphics processor

October 21, 2015, 10:37 am

≫ Next: Improving ARM® Mali™ drivers on fbdev

≪ Previous: ARM Announces Mali-470 GPU: Low Power For Wearables & More

Yesterday, the Media Processing Group at ARM announced a new highly-efficient graphics processing unit (GPU), the ARM® Mali™-470, to enable smartphone-quality visuals on wearable and IoT devices:

ARM Mali-470 GPU Offers Improved Efficiency and Experiences on Wearable and IoT Devices

A growing market with unique challenges

The wearables market has been growing steadily for many years with more and more devices and applications entering the market. ARM has long been associated with wearables, with many devices based on ARM technologies and more recently with the “Wearables for good” challenge in partnership with UNICEF and frog. That association now extends to graphics processing with the Mali-470.

Mali-470 is the latest in the Mali-400 series of graphics processors that run applications using the ubiquitous OpenGL® ES 2.0 graphics standard. The Mali-400 family has shipped in more than a billion devices worldwide and is favoured where efficient graphics processing is a must. An example being the growing number of System-on-Chips (SoCs) that are designed specifically for wearable and IoT applications, such as Mediatek’s MT2601 SoC, announced earlier this year in support of Google’s Android Wear software:

MediaTek Introduces MT2601 in Support of Google’s Android Wear Software

The key advantage of Mali-470 is that it consumes half the power of the Mali-400 GPU, helping device manufacturers bring the smartphone user experience to environments with even greater power-constraints.

Expanding the smartphone user experience

We’ve all become accustomed to high-quality visuals, backed by a touchscreen, as the most intuitive way of interacting with our smartphones and tablets. When we use other types of device, we want to interact with them in a very similar way.

For those of us who remember Video Cassette Recorders, and the frustration of trying to program the timer for the first time, it’s hard to imagine anyone tolerating that kind of user experience ever again. Yet, across many devices, the user interface quality has fallen far behind that of our smartphone.

From watches to thermostats, industrial control panels in factories and warehouses, multi-function printers in offices, infotainment systems in cars and home appliances, highly efficient graphics processing is essential to render intuitive user interfaces.

The challenge many of these devices face is power consumption and how to reduce it as the interface becomes more sophisticated – we think Mali-470 is the answer.

Why OpenGL ES 2.0?

Every pixel matters in delivering high-quality user interfaces. This is especially true for smaller screens where every pixel must play a role in conveying information clearly or providing intuitive controls or both.

The majority of Android™, Android Wear and other emerging operating systems, such as Tizen™, use OpenGL ES 2.0 for modern user interfaces, mapping, casual gaming, etc. OpenGL ES 2.0 offers the ideal balance between per-pixel control with programmable shaders and energy-efficiency. Mali-470 uses the same industry-standard OpenGL ES 2.0 driver stack as the Mali-400 GPU so there is no need to re-optimise existing applications – anything written for Mali-400 will work seamlessly on the Mali-470 GPU.

More recent versions of OpenGL ES have introduced a number of additional features to support immersive video games; however the OpenGL ES 2.0 feature level is the most efficient for user interfaces that appear on wearable and IoT devices.

Half the power consumption

Building on the success of the Mali-400 GPU, Mali-470 delivers the same rich performance at the same process geometry while halving the power consumption. This provides SoC manufacturers with scalable options to enable them to create embedded graphics subsystems that meet the needs of new low-power devices.

Mali-470 achieves this by building on the energy-efficiency gained in Mali-450 and applying focussed design changes to the Vertex and Fragment Processors. This results in half the power consumption with the same performance when compared to the Mali-400. Vertex processors construct the “wire frame” of a scene and the fragment processors perform the per-pixel shading, colours and effects such as transparency. For wearable device resolutions a single fragment processor is sufficient, but Mali-470 has the ability to scale to four fragment processors to support the higher resolutions of devices with larger screens.

Mali-470 block diagram: Up to 4 pixel processors can be implemented and this multi-core

design supports screen resolutions from 640x640 to 1080p at 60FPS 32bpp

The design improvements in Mali-470 can be grouped into three areas of equal importance: Quad-thread scheduling, Microarchitectural and Datapath optimisations.

Quad-thread scheduling optimisations:

Enforcing the grouping of quads (2x2 pixel threads) so that the frequency of control and state updates within the pipelines are significantly reduced.
Optimising many of the functional blocks to operate on quads.
Centralising a subset of per-quad state and accessing it only when necessary, rather than clocking it through the pipelines

Microarchitectural optimisations:

Making aggressive use of clock-gating throughout the design, including clock-gating of all function-orientated L1 caches
Bypassing functional blocks whenever instruction execution can proceed without them

Datapath optimisations:

Optimising datapaths to make targeted use of fixed-point arithmetic, rather than floating-point arithmetic for vertex processing

Wearables and beyond…

Designed for wearables and IoT devices, the Mali-470 GPU will benefit a multitude of devices that require a rich UI and where energy-efficiency is important, especially when coupled with ARM CPUs such as the Cortex®-A7 and A53 processors. You can see some of the possibilities below:

To summarise, the Mali-470 graphics processor further expands the smartphone experience into a wider range of devices including wearables, home gateways and appliances, industrial control panels, healthcare monitors and even new entry-level smartphones.

With half the power consumption of the billion selling Mali-400 GPUs, Mali-470 opens the door for more vibrant user interfaces and provides exciting opportunities for designers to innovate with graphics in even more power-constrained environments. We expect to see Mali-470 appearing in first devices from early 2017.

↧

Improving ARM® Mali™ drivers on fbdev

October 28, 2015, 10:16 am

≫ Next: Eats, Shoots and Interleaves

≪ Previous: ARM enables smartphone-quality visuals on wearable and IoT devices with the Mali-470 graphics processor

This blog post refers to the public ARM Mali Midgard r6p0 user-space binary drivers for GNU/Linux which are now available for download. The "fbdev" variant now has support for dma-buf using a standard ioctl, which is explained in detail here.

What is wrong with fbdev?

The Linux kernel defines a user-space API that lets applications control displays via frame buffer drivers, also known as "fbdev" drivers. This is achieved via a set of file operations on "/dev/fb*" devices such as ioctl or mmap, which allows direct access to the pixel data of a display. While this is a simple, widespread, lightweight and powerful interface, it is no longer suitable in the world of modern embedded graphics computing.

For example, it does not provide any means to share the frame buffer directly with another kernel driver such as a GPU driver. When running an application that renders graphics with a GPU, each frame contains pixels that the GPU generated within its own buffers. The display controller is a separate entity with its own frame buffer located elsewhere in memory. So the standard way of displaying the GPU frames in fbdev mode is for the user-space (EGL part of the driver) to copy the data from a GPU buffer to the display controller's frame buffer. This is done on the CPU, typically by calling memcpy. This works, but can easily become the biggest bottleneck in the system and is therefore not acceptable in a real product.

There are many other limitations to "fbdev" which are not covered in this blog post, such as managing layers, synchronising applications with the display controller or performing 2D composition in hardware.

What can we do to fix it?

All new display controller drivers should be implementing the Direct Rendering Manager (DRM) API instead of fbdev. DRM offers a much more modern set of features and solves many of the problems seen with fbdev. However, there are still a lot of GPU-accelerated products that control their display using an fbdev Linux driver. Work needs to be done on each of these to avoid the unacceptably inefficient CPU memcpy and to let the GPU directly write into the frame buffer.

A typical way of achieving this is by using the dma-buf framework in the Linux kernel. Typically, the frame buffer driver registers itself as a dma-buf exporter and implements a way for user-space to get a file descriptor for this frame buffer (an ioctl). The user-space driver then passes the dma-buf file descriptor to the GPU kernel driver to import and use. When rendering happens, the GPU pixels are directly written into the frame buffer using a hardware DMA operation - this is also referred to as "zero-copy" and is much faster than calling memcpy. However, there is no standard ioctl in the fbdev API to export the frame buffer with dma-buf so each fbdev driver has a slightly different one. This means the user-space needs to be modified to work with each fbdev driver, which is not compatible with the public standard Mali binary drivers as they must not depend on platform specific display drivers.

FBIOGET_DMABUF

This is why, starting with r6p0, we are adding a more generic ioctl implementation to export the dma-buf file descriptor defined as a custom extension to the standard fbdev API in supported kernels: the FBIOGET_DMABUF ioctl. This way, no extra dependency is being added on the user-space binary. If the ioctl is not available on a given kernel, then the user-space will revert to memcpy. We have already enabled this in our ODROID-XU3 Linux kernel branch on Github.

We should keep this added functionality in all our new fbdev Mali Midgard binary drivers and Linux kernels for supported platforms, such as ODROID-XU3, Chromebook devices with an ARM Mali GPU, Firefly...

↧

Eats, Shoots and Interleaves

November 2, 2015, 6:17 am

≫ Next: The "ARM Game Developer Day" is back by popular demand!

≪ Previous: Improving ARM® Mali™ drivers on fbdev

Vertex interleaving

Recently we were asked via the community whether there was an advantage in either interleaving or not interleaving vertex attributes in buffers. For the uninitiated, vertex interleaving is a way of mixing all the vertex attributes into a single buffer. So if you had 3 attributes (let’s call them Position (vec4), Normal(vec4), and TextureCoord(vec2)) uploaded separately they would look like this:

P1xP1yP1zP1w , P2xP2yP2zP2w , P3xP3yP3zP3w ... and so on

N1xN1yN1zN1w , N2xN2yN2zN2w , N3xN3yN3zN3w ... and so on

T1xT1y , T2xT2y , T3xT3y ... and so on

(In this case the commas denote a single vertex worth of data)

The interleaved buffer would look like this:

P1xP1yP1zP1w N1xN1yN1zN1w T1xT1y , P2xP2yP2zP2w N2xN2yN2zN2w T2xT2y ,

P3xP3yP3zP3w N3xN3yN3zN3w T3xT3y ... and so on

(Note the colours for clarity)

… Such that the individual attributes are mixed, with a given block containing all the information for a single vertex. This technique is what the stride argument in the glVertexAttribPointer function and its variants is for, allowing the application to tell the hardware how many bytes it has to jump forwards to get to the same element in the next vertex.

However, even though we all knew about interleaving, none of us could really say whether it was any better or worse than just putting each attribute in a different buffer, because (to put it bluntly) separate buffers are just easier to implement.

So in a twist to usual proceedings I have conferred with arguably the top expert in efficiency on Mali, Peter Harris. What follows is my interpretation of the arcane runes he laid out before my quivering neurons:

Interleaving is better for cache efficiency…

… Sometimes.

Why does interleaving work at all?

The general idea behind interleaving is related to cache efficiency. Whenever data is pulled from main memory it is loaded as part of a cache line. This single segment of memory will almost certainly contain more than just the information desired, as one cache line is larger than any single data type in a shader program. Once in the local cache the data in the loaded line is more quickly available for subsequent memory reads. If this cache line only contains one piece of required information, then the next data you need is in a different cache line which will have to be brought in from main memory. If however, the next piece of data needed is in the same cache line, the code can fetch directly from the cache and so performs fewer loads from main memory and therefore executes faster.

Without getting into physical memory sizes and individual components, this can be illustrated thusly:

Imagine we have 3 attributes, each of them vec4s. Individually they look like this:

| P1 P2 | P3 P4 | ...

| N1 N2 | N3 N4 | ...

| T1 T2 | T3 T4 | ...

From this point forward those vertical lines represent the boundaries between cache lines. For the sake of argument, the cache lines in this example are 8 elements long, so contain 2 vec4s; but in the real world our cache lines are 64 bytes, large enough to hold four 32-bit precision vec4 attributes. For the sake of clear illustration I’ll be keeping the data small in these examples, so if we want all the data for vertex number 2 we would load 3 cache lines from the non-interleaved data:

P1 P2

N1 N2

T1 T2

If this data is interleaved like so:

| P1 N1 | T1 P2 | N2 T2 | P3 N3 | T3 P4 | N4 T4 | ...

The cache lines fetched from main memory will contain:

T1 P2

N2 T2

(We start from T1 because of the cache line alignment)

Using interleaving we've performed one less cache line fetch. In terms of wasted bandwidth, the non-interleaved case loaded 3 attributes which went unused, but only one unused attribute was fetched in the interleaved case. Additionally, it's quite possible that the T1 P2 cache line wouldn't need to be specifically fetched while processing vertex 2 at all; if the previously processed vertex was vertex 1, it is likely that the data will still be in the cache when we process vertex 2.

Beware misalignment

Cache efficiency can be reduced if the variables cross cache line boundaries. Notice that in this very simple example I said the texture coordinates were vec4s. Ordinarily textures would be held in vec2 format, as shown in the very first explanation of interleaving. In this case, visualising the individual elements of the buffer, the cache boundaries would divide the data in a very nasty way:

PxPyPzPw NxNyNzNw | TxTy PxPyPzPw NxNy | NzNw TxTy PxPyPzPw | …

Notice that our second vertex's normal is split, with the x,y and z,w in different cache lines. Though two cache lines will still contain all the required data, it should be avoided as there is a tiny additional power overhead in reconstructing the attribute from two cache lines. If possible it is recommended to avoid splitting a single vector over two cache lines (spanning a 64-byte cache boundary), which can usually be achieved by suitable arrangement of attributes in the packed buffer. In some cases adding padding data may help alignment, but padding itself creates some inefficiencies as it introduces redundant data into the cache which isn’t actually useful. If in doubt try it and measure the impact.

But it's not always this simple

If we look at the function of the GPU naïvely, all of the above makes sense, however the GPU is a little cleverer than that. Not all attributes need to be loaded by the vertex processor. The average vertex shader looks something like this:

uniform vec4 lightSource;

uniform mat4 modelTransform;

uniform mat4 cameraTransform;

in vec4 position;

in vec4 normal;

in vec2 textureCoord;

in vec2 lightMapCoord;

out float diffuse;

out vec2 texCo;

out vec2 liCo;

void main( void ){

texCo = textureCoord;

liCo = lightMapCoord;

diffuse = dot((modelTransform*normal),lightSource);

gl_Position=cameraTransform*(modelTransform*position);

}

If you look at our outputs diffuse is calculated at this stage, as is gl_Position, but texCo and liCo are just read from the input and passed straight back out without any computation performed. For a deferred rendering architecture this is really a waste of bandwidth as it doesn’t add any value to the data being touched. In Midgard family GPUs (Mali-T600 or higher) the driver understands this (very common) use case and has a special pathway for it. Rather than load it in the GPU and output it again to be interpolated, the vertex processor never really sees attributes of this type. They can bypass the vertex shader completely and are just passed directly to the fragment shader for interpolation.

Here I've used a second set of texture coordinates to make the cache align nicely for this example. If we fully interleave all of the attributes our cache structure looks like this

PxPyPzPw NxNyNzNw | TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy | ...

Here the vertex processor still needs to load in two attributes P and N, for which the cache line loads will either look like:

PxPyPxPw NxNyNzNw

… or …

TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy

… to obtain the required data, depending on which vertex we are loading. In this latter case the T and L components are never used, and will be loaded again separately to feed into the interpolator during fragment shading. It’s best to avoid the redundant data bandwidth of the T and L loads for the vertex shading and the redundant loads of P and N when fragment shading. To do this we can interleave the data into separate buffers, one which contains all of the attributes needed for computation in the vertex shader:

PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | ...

… and one containing all of the attributes which are just passed directly to interpolation:

TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | ...

This means that the vertex shader will only ever need to touch the red and green cache lines, and the fragment interpolator will only ever have to touch the blue and orange ones (as well as any other interpolated outputs from the vertex shader). This gives us a much more efficient bandwidth profile for the geometry processing. In this particular case it also means perfect cache alignment for our vertex processor.

A note on data locality

Caches function best when programs make use of the data in the same cache lines in a small time window. This maximizes the chance that data we have fetched is still in the cache and avoids a refetch from main memory. Cache lines often contain data from multiple vertices which may come from multiple triangles. It is therefore best practise to make sure that these adjacent vertices in memory are also nearby in the 3D model (both in terms of attribute buffers and index buffers). This is called data locality and you normally need look no further than your draw call's indices (if you are not using indexed models you have far bigger problems than cache efficiency to solve). If the indices look like this:

(1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 2) (1, 3, 5) ...

You have good data locality. On the other hand, if they look like this:

(1, 45, 183) (97, 12, 56) (4, 342, 71) (18, 85, 22) ...

… then they're all over the place and you’ll be making your GPU caches work overtime. Most modelling software will have some kind of plugin to better condition vertex ordering, so talk to your technical artists to get that sorted somewhere in the asset production process.

To maximize the cache efficiency it’s also worth reviewing the efficiency of your vertex shader variable types, both in terms of sizes and number of elements. We see a surprising amount of content which declares vector elements and then leaves many channels unused (but allocated in memory and so using valuable cache space); or which uploads highp fp32 data and then uses it in the shader as a medium fp16 value. Removing unused vector elements and converting to narrower data types (provided the OES_vertex_half_float extension is available) is a simple and effective way to maximize cache efficiency, reduce bandwidth, and improve geometry processing performance.

So there you have it, interleaving vertex attributes. It would be remiss of me to tell you to expect immediate vast performance improvements from this technique. At best this will only cleave back a tiny bit of efficiency but in large complex projects where you need to squeeze as much as possible out of the hardware, these tiny improvements can all add up.

Thanks again to Peter Harris, who provided a lot of the information for this blog and also was kind enough to go through it afterwards and take out all my mistakes.

↧

The "ARM Game Developer Day" is back by popular demand!

November 12, 2015, 7:22 am

≫ Next: Virtual Reality: The Ice Cave

≪ Previous: Eats, Shoots and Interleaves

Be part of the future of mobile game technology. Bring along your project challenges for expert advice on how to maximise performance of your game for mobile platforms. Learn about the latest ARM CPU and Mali GPU architecture, multicore programming and tile-based mobile GPUs and see how to implement highly optimized rendering effects for mobile.

The day will feature:

VR showcase area with live demos
Q&A Clinics and the sharing of best coding practice for mobile platforms
Talks on mobile VR, Geomerics advanced applications of dynamic global illumination plus a panel discussion on “the future of mobile game technology”
Open discussions with ARM experts about your own project challenges and suggested improvements
The opportunity to network with top mobile game developers and engineers
Free hot lunch and refreshments throughout the day

London - ARM Game Developer Day

Thursday 3rd December 2015

9:00am - 6:00pm

Rich Mix, 35-47 Bethnal Green Rd, London E1 6LA

See full agenda and register here to make sure you don’t miss out!

Chengdu - ARM Game Developer Day

Wednesday 16th December 2015

9:00am - 6:00pm

iTOWN Coffee, Building A, Chengdu Tianfu Software Park, China

See full agenda and register here too!

↧

Virtual Reality: The Ice Cave

November 19, 2015, 6:00 am

≫ Next: Mali Performance 7: Accelerating 2D rendering using OpenGL ES

≪ Previous: The "ARM Game Developer Day" is back by popular demand!

This blog was written by Kapileshwar Syamasundar during his summer placement at ARM in the ARM Mali Graphics demo team. Kapil did some great work at ARM porting the Ice Cave demo to VR using Unity, we hope you can benefit from this too.

Ice Cave VR

1. Introduction

Ice Cave, the latest demo from ARM Mali Ecosystem, has been shown with great success this year in such major events as GDC, Unite Europe, and Unite Boston. The demo has been developed in Unity and aims to demonstrate that it is possible to render high visual quality content on current mobile devices. A number of highly optimized special effects were developed in-house, specifically for this demo, some of which are based on completely new techniques, for example the rendering of shadows and refractions based on local cubemaps.

Figure 1 View of the cave from the entrance in the Ice Cave demo.

The Ice Cave demo was released at a time when Virtual Reality has become the centre of attention in the game development community, and related events and media. A number of VR demos and games have already been released but VR performance requirements can limit the complexity of VR content and therefore the visual quality of the final VR experience.

It is in this landscape that the Ecosystem demo team decided to port the Ice Cave demo to Samsung Gear VR and this task was assigned to me. In this blog I describe my experience in porting the Ice Cave demo to VR during my eight weeks summer placement in the Ecosystem demo team.

By the time I joined the demo team, Unity had just released a version with VR native support for Oculus Rift and Samsung Gear VR. Previously, VR support was only available by means of a plugin based on Oculus Mobile SDK, but this had some obvious limitations:

Each VR device has a different plugin
Plugins may conflict with each other
Release of newer VR SDKs / Runtimes can break older games
Lower level engine optimizations are not possible with plugin approach of two separate cameras

Conversely, the newly released Unity VR native integration lacked both support and sufficient information for developers, and experienced many unresolved issues. Nonetheless, the team was convinced that with the native integration in Unity we would be able to achieve the best possible performance; a key point in guaranteeing a successful VR user experience.

2. Samsung Gear VR

The Samsung Gear VR headset does not have a built in display but has instead been designed to host a mobile phone. At the time of writing, the Samsung Gear VR comes in two versions; one for Samsung Note 4 and another for the latest Samsung Galaxy S6. Some of the main specifications of the Samsung Galaxy S6 version are listed below.

· Sensors: Accelerator, Gyrometer, Geomagnetic, Proximity	Figure 2. The Samsung Gear VR for Samsung Galaxy S6.
· Motion to Photon Latency < 20ms
· Manual Focal Adjustment
· Main Physical UI: Touch Pad
· Oculus’s Asynchronous TimeWarp technology

Samsung Gear VR is powered by Oculus VR software and incorporates the Oculus Asynchronous Time Warp technology. This important feature helps reduce latency, or the time taken to update the display based on the latest head movement; a key issue to avoid in VR devices. Besides the Time Warp technology, the Samsung Gear VR has several sensors which it uses in place of the ones incorporated in the phone.

The Samsung Gear VR has its own hardware and features a touch pad, back button, volume key and, according to the specifications, an internal fan designed to help demist the device while in use.

The key point here however, is that you can insert your Samsung Galaxy S6 into the headset and enjoy an immersive experience with just a smartphone. We are no longer limited to the screen size of the phone and can instead become completely immersed in a virtual world.

3. Main steps to port an app/game to VR in Unity

VR integration in Unity has been achieved following one of the main Unity principles, that it must be simple and easy. The following basic steps are all that are needed to port a game to VR:

· Unity 5.1 version with VR native support (or any higher version).

· Obtain the signature file for your device from the Oculus website and place it in Plugins/Android/assets folder.

· Set the “Virtual Reality Supported” option in Player Settings.

· Set a parent to camera. Any camera control must set camera position and orientation to the camera parent.

· Associate the camera control with the Gear VR headset touch pad.

· Build your application and deploy it on the device. Launch the application.

· You will be prompted to insert the device into the headset. If the device is not ready for VR you will be prompted to connect to the network where the device will download Samsung VR software.

· NB. It is useful to set the phone to developer mode to visualize the application running in stereo without inserting into the Gear VR device. You can enable the developer mode only if you have installed previously a VR application appropriately signed.

Enabling Gear VR developer mode

• Go to your device Settings - Application Manager - Gear VR Service

• Select "Manage storage"

• Tap on the "VR Service Version" six times

• Wait for scan process to complete and you should now see the Developer Mode toggle

Developer mode allows you to launch the application without the headset and also dock the headset at any time without having Home launch.

Figure 3. Steps to enable VR Developer mode on Samsung Galaxy S6.

Figure 4 Side by Side view of stereo viewports captures with VR developer mode enabled.

4. Not as simple as it seems. Considering VR specifics.

After following the instructions above, I saw nothing but a black screen when inserting the device into the headset. It took me some time to get the VR application running in order to establish that some existing features had to be changed and others added.

VR is a completely different user experience and this is therefore one of the key issues when porting to VR. The original demo had an animation mode which moved the camera through different parts of the cave to show the main features and effects. However, in VR this animation caused motion sickness to the majority of users, particularly when moving backwards. We therefore decided to remove this mode completely.

We also decided to remove the original UI. In the original Ice Cave demo a tap on the screen triggers a menu with different options but this was unsuitable for VR. The original navigation system, based on two virtual joysticks, was also unsuitable for VR so we decided to entirely replace it with a very simple user interaction based on the touch pad:

· Pressing and holding the touch pad moves the camera in the direction the user looks.

· When you release the pressure the camera stops moving.

· A double tap resets the camera to the initial position.

This simple navigation system was deemed to be intuitive and easy by all users trying the VR version of the demo.

Figure 5. User interaction with touch pad on the Samsung Gear VR.

The camera speed was also a feature we considered carefully as many users experienced motion sickness when the camera moved just a little too fast. After some tests we were able to set a value that most people were comfortable with.

Additionally, the camera has to be set as a child of a game object. This is the only way Unity can automatically integrate the head tracking with the camera orientation. If the camera has no parent this link will fail so any translation and rotation of the camera has to be applied to the camera parent node.

In VR, as in reality, it is important to avoid tight spaces so the user doesn’t feel claustrophobic. The original Ice Cave was built with this in mind and provides ample space for the user.

The only effect not imported to VR was the dirty lens effect. In the original Ice Cave demo this effect is implemented as a quad that is rendered on top of the scene. A dirty texture appears with more or less intensity depending on how much the camera is aligned with the sun. This didn’t translate well to VR and so the decision was made to completely remove it from the VR version.

Figure 6. Dirty lens effect implemented in the original Ice Cave demo.

5. Extra features in the Ice Cave VR version

In the original demo the user can pass through the walls to look at the cave from the outside. However in VR this didn’t create a good experience and the sensation of embedding disappeared when you went out of the cave. Instead, I implemented camera collision detection and smooth sliding for when the user moves very close to the walls.

When running a VR application on Samsung Gear VR, people around the user are naturally curious about what the user is actually seeing. We thought that it would be interesting, particularly for events, to stream the content from the VR headset to another device such as a tablet. We decided to explore the possibility of streaming just the camera position and orientation to a second device running a non-VR version of the same application.

The new Unity network API allowed a rapid prototyping and in a few days I had an implementation which worked pretty well. The device actually running the VR version on the Samsung Gear VR works as a server and in each frame sends the camera position and orientation over wireless TCP to a second device that works as a client.

Figure 7. Streaming camera position and orientation from Samsung Gear VR to a second device.

Using the built-in touch pad to control the camera motion proved very successful. Nevertheless, we decide to provide the user with an alternative method of control using an external Bluetooth mini controller readily available elsewhere. This required us to write a plugin to extend the Unity functionality by intercepting the Android Bluetooth events and using them to trigger movement and resetting of the camera. Unfortunately there is not much information available so whilst it was only possible to intercept the messages coming from two keys , this was enough to move/stop and reset the camera.

6. Conclusions

Ice Cave VR was implemented during my summer placement with ARM’s Ecosystem Demo team in less than eight weeks with no previous experience of Unity. This was possible thanks to the native VR integration Unity released on version 5.1. In principle, just a few steps are necessary to port a game to VR, although in practice you need to do some extra work to fine-tune the specific requirements of VR in your game. With this integration, Unity has greatly contributed to the democratisation of VR.

Unity VR integration is still in progress and some reported issues are expected to be solved in coming versions. Nonetheless, the Ice Cave VR version shows that it is possible to run high quality VR content on mobile devices if resources are balanced properly at runtime by using highly optimized rendering techniques.

All the advanced graphics techniques utilised in the Ice Cave demo are explained in detail in the ARM Guide for Unity Developers. In the guide it is possible to find the source code or code snippets of these techniques which allowed me to understand how they work.

What I consider the most relevant in all this is the fact that with mobile VR we are no longer limited to the size of our smartphones to enjoy a game. Now we can be part of a limitless virtual world and enjoy a wonderful VR experience from a tiny smartphone inserted in a head set. This really is an outstanding step forward!

Now you can check out the video

↧

Mali Performance 7: Accelerating 2D rendering using OpenGL ES

November 19, 2015, 1:49 pm

≫ Next: Is Virtual Reality the Next Big Thing…Again?

≪ Previous: Virtual Reality: The Ice Cave

It's been a while since my last performance blog, but one of those lunchtime coffee discussions about a blog I'd like to write was turned into a working tech demo by wasimabbas overnight, so thanks to him for giving me the kick needed to pick up the digital quill again. This time around I look at 2D rendering, and what OpenGL ES can do to help ...

A significant amount of mobile content today is still 2D gaming or 2D user-interface applications, in which the applications render layers of sprites or UI elements to build up what is on screen. Nearly all of these applications are actually using OpenGL ES to perform the rendering, but few applications actually leverage the 3D functionality available, preferring the simplicity of a traditional back-to-front algorithm using blending to handle alpha transparencies.

This approach works, but doesn’t make any use of the 3D features of the hardware, and so in many cases makes the GPU work a lot harder than it needs to. The impacts of this will vary from poor performance to reduced battery life, depending on the GPU involved, and these impacts are amplified by the trend towards higher screen resolutions in mobile devices. This blog looks at some simple changes to sprite rendering engines which can make the rendering significantly faster, and more energy efficient, by leveraging some of the tools which a 3D rendering API provides.

Performance inefficiency of 2D content

In 2D games the OpenGL ES fragment shaders are usually trivial – interpolate a texture coordinate, load a texture sample, blend to the framebuffer – so there isn’t much there to optimize. Any performance optimization for this type of content is therefore mostly about finding ways to remove redundant work completely, so that the shader never even runs for some of the fragments.

The figure in the introduction section shows a typical blit of a square sprite onto a background layer; the outer parts of the shield sprite are transparent, the border region is partially transparent so it fades cleanly into the background without any aliasing artifacts, and the body of the sprite is opaque. These sprite blits are rendered on top of what is in the framebuffer in a back-to-front render order, with alpha blending enabled.

There are two main sources of inefficiency here:

Firstly, the substantial outer region around this sprite is totally transparent, and so has no impact on the output rendering at all, but takes time to process.
Secondly, the middle part of the sprite is totally opaque, completely obscuring many background pixels underneath it. The graphics driver cannot know ahead of time that the background will be obscured, so these background fragments have to be rendered by the GPU, wasting processing cycles and nanojoules of energy rendering something which is not usefully contributing to the final scene.

This is a relatively synthetic example with only a single layer of overdraw, but we see real applications where over half of all of the rendered fragments of a 1080p screen are redundant. If applications can use OpenGL ES in a different way to remove this redundancy then the GPU could render the applications faster, or use the performance headroom created to reduce the clock rate and operating voltage, and thus save a substantial amount of energy. Either of these outcomes sounds very appealing, so the question is how can application developers achieve this?

Test scene

For this blog we will be rendering a simple test scene consisting of a cover-flow style arrangement of the shield icon above, but the technique will work for any sprite set with opaque regions. Our test scene render looks like this:

… where each shield icon is actually a square sprite using alpha transparencies to hide the pieces which are not visible.

Tools of the trade

In traditional dedicated 2D rendering hardware there are not usually many options to play with; the application has to render the sprite layers from back to front to make sure blending functions correctly. In our case the applications are using a 3D API to render a 2D scene, so the question becomes what additional tools does the 3D API give the applications which can be used to remove redundant work?

The principal tool used in full 3D scene rendering to remove redundant work is the depth test. Every vertex in a triangle has a “Z” component in its position, which is emitted from the vertex shader. This Z value encodes how close that vertex is to the camera, and the rasterization process will interpolate the vertex values to assign a depth to each fragment which may need fragment shading. This fragment depth value can be tested against the existing value stored in the depth buffer and if it is not closer¹ to the camera than the current data already in the framebuffer then the GPU will discard the fragment without ever submitting it to the shader core for processing, as it now safely knows that it is not needed.

Using depth testing in “2D” rendering

Sprite rendering engines already track the layering of each sprite so that they stack correctly, so we can map this layer number to a Z coordinate value assigned to the vertices of each sprite which is sent to the GPU, and actually render our scene as if it has 3D depth. If we then use a framebuffer with a depth attachment, enable depth writes, and render the sprites and background image in front-to-back order (i.e. the reverse order of normal blitting pass which is back-to-front) then the depth test will remove parts of sprites and the background which are hidden behind other sprites.

If we run this for our simple test scene, we get:

Uh oh! Something has gone wrong.

The issue here is that our square sprite geometry does not exactly match the shape of opaque pixels. The transparent parts of the sprites closer to the camera are not producing any color values due to the alpha test, but are still setting a depth value. When the sprites on a lower layer are rendered the depth testing means that the pieces which should be visible underneath the transparent parts of an earlier sprite are getting incorrectly killed, and so only the OpenGL ES clear color is showing.

Sprite geometry

To fix this issue we need to invest some time into setting up some more useful geometry for our sprites. We can only safely set the depth value when rendering front-to-back for the pixels which are totally opaque in our sprite, so the sprite atlas generation needs to provide two sets of geometry for each sprite. One set, indicated by the green area in the middle image below, covers only the opaque geometry, and the second, indicated by the green area in the right image below, picks up everything else unless it is totally transparent (in which case it can be dropped completely).

Vertices are relatively expensive, so use as little additional geometry as possible when generating these geometry sets. The opaque region must only contain totally opaque pixels, but the transparent region can safely contain opaque pixels and totally transparent pixels without side-effects, so use rough approximations for a "good fit" without trying to get "best fit". Note for some sprites it isn’t worth generating the opaque region at all (there may be no opaque texels, or the area involved may be small), so some sprites may consist of only a single region rendered as a transparent render. As a rule of thumb, if your opaque region is smaller than 256 pixels it probably isn't worth bothering with the additional geometry complexity, but as always it's worth trying and seeing.

Generating this geometry can be relatively fiddly, but sprite texture atlases are normally static so this can be done offline as part of the application content authoring process, and does not need to be done live on the platform at run time.

Draw algorithm

With the two geometry sets for each sprite we can now render the optimized version of our test scene. First render all of the opaque sprites regions and the background from front-to-back, rendering with depth testing and depth writes enabled. This results in the output below:

Any area where one sprite or the background is hidden underneath another sprite is rendering work which has been saved, as that is an area which has been be removed by the early depth test before shading has occurred.

Having rendered all of the opaque geometry we can now render the transparent region for each sprite in a back-to-front order. Leave depth testing turned on, so that sprites on a lower layer don't overwrite an opaque region from a sprite in a logically higher layer which has already been rendered, but disable depth buffer writes to save a little bit of power.

If we clear the color output of the opaque stage, but keep its depth values, and then draw the transparent pass, we can visualize that the additional rendering added by this pass. This is show in the figure below:

Any area where one of the outer rings is partial indicates an area where work has been saved, as the missing part has been removed by the depth test using the depth value of an opaque sprite region closer to the camera which we rendered in the first drawing pass.

If we put it all together and render both passes to the same image then we arrive back at the same visual output as the original back-to-front render:

... but achieve that with around 35% fewer fragment threads started, which should translate approximately to a 35% drop in MHz required to render this scene. Success!

The final bit of operational logic needed is to ensure that the depth buffer we have added to the scene is not written back to memory. If your application is rendering directly to the EGL window surface then there is nothing to do here, as depth is implicitly discarded for window surfaces automatically, but if your engine is rendering to an off-screen FBO ensure that you add a call to glInvalidateFramebuffer() (OpenGL ES 3.0 or newer) or glDiscardFramebufferEXT() (OpenGL ES 2.0) before changing the FBO binding away from the offscreen target. See Mali Performance 2: How to Correctly Handle Framebuffers for more details.

Summary

In this blog we have looked at how the use of depth testing and depth-aware sprite techniques can be used to accelerate rendering using 3D graphics hardware significantly.

Adding additional geometry to provide the partition between opaque and transparent regions of each sprite does add complexity, so care must be taken to minimize the number of vertices for each sprite otherwise the costs of additional vertex processing and small triangle sizes will out-weigh the benefits. For cases where the additional geometry required is too complicated, or the screen region covered is too small, simply omit the opaque geometry and render the whole sprite as transparent.

It’s worth noting that this technique can also be used when rendering the 2D UI elements in 3D games. Render the opaque parts of the UI with a depth very close to near clip plane before rendering the 3D scene, then render the 3D scene as normal (any parts behind the opaque UI elements will be skipped), and finally the remaining transparent parts of the UI can be rendered and blended on top of the 3D output. To ensure that the 3D geometry does not intersect the UI elements glDepthRange() can be used to limit the range of depth values emitted by the 3D pass very slightly, guaranteeing that the UI elements are always closer to the near clip plane than the 3D rendering.

Tune in next time,

Pete

[1] Other depth test functions are possible in OpenGL ES, but this is the common usage which is analogous to the natural real world behaviour.

Pete Harris is the lead engineer for the Mali GPU performance analysis team at ARM. He enjoys spending his time working on a whiteboard to determining how to get the best out of complex graphics sub-systems, and how to make the ARM Mali GPUs even better.

↧

Is Virtual Reality the Next Big Thing…Again?

November 26, 2015, 6:15 am

≫ Next: Virtual Reality: A Focus on Focus

≪ Previous: Mali Performance 7: Accelerating 2D rendering using OpenGL ES

In this blog series we’ll be looking at the current status of Virtual Reality with a special focus on mobile VR, where it’s heading in the future and what ARM® can do to help us get there. We’ll be covering the common challenges and pitfalls and how these can be avoided to produce a great VR experience. We’ll also keep you up to date with other blogs and discussions around technical tips, tutorials and FAQ’s; so stay tuned for the future of mobile VR.

Where VR is now

Virtual Reality isn’t new, people have been talking about it since the 90s, so why has the industry never quite taken off in the way we might expect? The quick answer is that the technology simply wasn’t there. The hardware was prohibitively expensive and very bulky and the graphics capabilities were too limited to produce a successful VR user experience - unless you consider motion sickness a success. Now however, lower cost hardware based on existing platforms is changing the game, with mobile platforms offering console-like performance. Not only that, but existing mobile devices already contain many of the sensors VR requires, from gyros to accelerometers, opening up a whole world of mobile VR possibilities.

What’s next for VR

The Virtual Reality industry has a forecast worth of US$30 billion by 2020, and that all has to come from somewhere.

Fig.1 Digi-Capital VR Revenue Forecast

Gaming is of course a huge industry and a high-end, immersive gamer experience can now be literally at your fingertips. Mobile VR allows you to become fully involved in your chosen game at home, work, or while trying to escape the monotony of public transport; but that’s not all VR can do. Researching a university assignment can be a chore, but how about if you could visit the most relevant museums or seminars without having to leave the dorm? VR allows us to see exhibitions in world class museums and galleries without an expensive trip to London, Paris, or anywhere else. Shopping too, isn’t everyone’s favourite pastime, especially around the Christmas rush. Wouldn’t it be great if you could wander the aisles and compare options for your next car, sofa, TV or even pair of shoes, without tripping over pushchairs or being upsold by pushy assistants? All this is possible with the huge technical advances in VR and it’s only a matter of time until this is our standard way of working.

Fig.2 nDreams® Perfect Beach experience allows you to get away from it all without leaving the sofa

So how does VR actually work?

Technology is the key to VR success and this blog series will talk about exactly what you need to make it happen. VR comes in mobile or desktop options, but according to Oculus® Co-founder Palmer Luckey, desktop VR is seriously compromised by the requirement for a ‘cable servant’ to follow the user around preventing trip hazards. So mobile VR is the quickest way forward, and the simplest of the mobile options allows you to simply slot your smartphone into the headset and get started. The headset provides you with a stereoscopic display, with two marginally different images rendered for the left and right eye, allowing the user to experience depth. Barrel distortion is then applied to the rendered images in post processing to counteract the curvature of the lenses.

Fig.3 Marginally different images for each eye allow the perception of depth and barrel distortion applies curvature to the image to counteract the curvature of the lens

Finally, sensors in the device detect the movement of your head and adjust the scene in real time to render the updated view to the headset and allow realistic visual feedback. Going forward, additional sensors will facilitate live hand-tracking for a truly immersive experience, and this can be combined with the use of an inbuilt or add-on controller to allow you to interact fully with your virtual surroundings.

VR Optimisation with Mali GPUs

As with any emerging technology, there are issues that can stand in the way of a truly successful VR user experience. These include low resolution blurring the image and compromising visual quality, or a low frame rate making the display appear stilted or jerky. A major issue experienced when developing for VR is latency, or the time it takes for the on-screen image to catch up with the user’s head movement, and this is one of the key causes of sickness or dizziness in VR users.

The ARM® Mali™ GPU family is the world’s #1 licensable GPU in terms of shipments and is perfectly positioned to deliver an optimum VR experience. Mali GPU architecture enables high resolution and power saving through various features such as Adaptive Scalable Texture Compression (ASTC); and ARM Frame Buffer Compression (AFBC) dramatically reduces system bandwidth, with performance fully scalable across multiple cores. Mali support for extensions to OpenGL ES and EGL reduce latency and improve overall performance.

What we’re doing now

At events like VRTGO ARM recently demonstrated how great a mobile VR experience can be with the Mali-based Samsung® Gear VR headset, a collaboration from Samsung Mobile and Oculus. The first version was based on the Galaxy Note 4, with the second generation now available for the Galaxy S6, both powered by the Mali-T760. The Ice Cave Demo, featuring Geomerics Enlighten global illumination in collaboration with RealtimeUk; was easily ported to VR on the Samsung Gear VR headset; read about how we did this here.

Superior visuals and a smooth user experience are all possible in mobile VR and throughout this blog series we’ll be discussing the common challenges surrounding developing for VR and how ARM’s technology and Mali GPUs can help you overcome them.

Stay tuned for more on mobile VR!

↧

Virtual Reality: A Focus on Focus

December 14, 2015, 6:09 am

≫ Next: Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

≪ Previous: Is Virtual Reality the Next Big Thing…Again?

As discussed in our virtual reality (VR) intro blog, whilst the VR industry is moving from strength to strength, there are still a few key issues which often stand in the way of developing a truly successful VR experience. For the purposes of this blog series we’ll be focussing on what we consider to be some of the main blockers. We’ll discuss the best ways to overcome these with technology and techniques that are currently available, as well as considering what we foresee in the future of the field.

So, to focus on focus. Our eyes are complicated objects and work very precisely to ensure the images we see are processed clearly and accurately (unless you’re a secret spectacle-wearer like me, in which case they’re letting the side down to be honest). The eye works in a similar way to a magnifying glass by concentrating light through the cornea and lens and directing it to the retina. When an image is in focus, all paths of light from a single pixel travel to a single point on the retina, allowing us to see it clearly. When an image is out of focus however, that same pixel travels to different parts of the eye, causing the image to appear blurry.

Light paths reaching the eye in focus

Light paths reaching the eye out of focus

This is a particular issue in current mobile VR systems as the images you see are all the same distance from your eye, namely just a few centimetres away on the screen of your smartphone. The images remain at this focal depth even if the differences between what is shown to each eye tell us that the image is at a different depth. This conflict between the depth from the stereo image (vergence) and the apparent focal depth (accommodation) makes it difficult for our brains to adjust. This can lead to visual discomfort, headaches and nausea.

The impact on VR

The simplest type of video content produced for VR is monoscopic, using what amounts to a single ‘360°’ camera. In practice this single image will likely use multiple cameras to cover the entire field of view and stitch the separate images together retrospectively. This is the cheapest and easiest method of 360° video production but doesn’t provide any of that vital depth information. A more complicated, but arguably better, approach is to use stereoscopic video which is produced by capturing two 360° images, one for each eye, providing depth perception from the stereo image. This method is more difficult to get right due to the complications of stitching the images for each eye together accurately. Capturing enough information to reproduce the focal depth as well as a stereo image is more complicated still, but advances are being made every day.

Arguably, the most successful way of addressing this issue of focus currently is with the effective use of light field displays. A ‘light field’ is how we describe all the light travelling through a region of space. A simple way to think about a light field display is to think about a regular display that could show you a different picture depending on the direction from which you viewed it. Although it may not be obvious, light field displays allow us to show images that have different focal depths. Objects further away from the viewer would create a noticeably different light field on the display than the same objects closer up. This requires the eye to adjust its focus to see them clearly, removing the brain-confusing mismatch between the focal and stereo depths.

Micro-lens Arrays

One way of creating a light field display to improve focus in VR is with the use of a micro-lens array, which is an overlaid transparent sheet with huge numbers of tiny, raised lenses.

Micro-lens array

Each micro-lens covers a small number of pixels of a regular display. These are beginning to emerge as technologies for wearables such as smartwatches and the image you see changes dependent upon which way you view it, a bit like an advanced version of the lenticular images you get in breakfast cereal boxes. However, this micro-lens method forces a trade off against resolution as it’s effectively turning multiple pixels into one.

Micro-lens display

Micro-lens arrays are also reported to be complicated to produce at present, so we’ll also consider an alternative option.

The Light Field Stereoscope

To take advantage of the depth benefits of stereoscopy, multi-layer displays are currently being researched, where multiple display panels are layered with small gaps separating them. The eye sees each panel at a different focal distance and so by carefully crafting the content displayed on each layer, the display can place images at an appropriate depth. At SIGGRAPH 2015, Stanford University’s Huang et al presented the ‘Light field Stereoscope’ in which two LCD panels are backlit and placed one behind the other in the VR headset, with a spacer between. This allows us to project the background, or distant, images on the rear screen while the front screen displays images in the foreground of the scene you’re viewing. Distances in the middle of this range can be depicted by displaying partial images on each. This approximate light field adds some focal depth to the display, with objects in the foreground occluding those further back. The interaction of the two 2D displays is not the same as a true 4D light field but may well be sufficient. However, while there isn’t the same resolution trade off that we saw with the micro-lens approach, the front LCD panel acts as a diffuser and can therefore introduce some blurring.

Huang et al’s Light Field Stereoscope

What comes next?

To accompany light field displays, in recent years we have seen the emergence of light field cameras such as Lytro, which capture the full 4D light field across a camera aperture. These computational cameras allow you to produce light field images that allow you to refocus or move the view position after capture, opening up all kinds of creative possibilities. These cameras are also ostensibly better than your average camera as they typically capture more light. Next-generation 360° light field cameras offer to take this further and extend the focus and viewpoint-independent properties of light field images to VR. This bodes well for the future of 360° light field video playback on mobile VR devices. It will allow the user to explore the whole scene with freedom of head movement and natural depth perception, all within a VR headset. Another emerging area in VR is positional tracking on mobile, which will allow real time response of the images to the actual physical location of the user’s head, a vital point for achieving a truly immersive experience and something we’ll be considering in more depth in the future.

Follow me on the ARM Connected Community and Twitter to make sure you don’t miss out on the next post in our VR series!

↧

Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

December 17, 2015, 2:47 am

≫ Next: Shedding Light on Global Illumination Techniques

≪ Previous: Virtual Reality: A Focus on Focus

ARM®’s newest high performance GPU, the Mali™-T880, delivers better than ever graphics performance whilst remaining well within the energy constraints of a mobile platform. The exciting possibilities this opens up are beginning to be realised with the recent announcements of the Mali-T880 appearing in key devices by two of ARM’s partners.

As graphics content for mobile devices becomes increasingly complex, it places greater and greater pressure on the computational capabilities of GPUs. For a start, consumer expectations are that mobile gaming will provide the same high quality user experience as that available on console. The recent surge in virtual reality content, which is expected to make the biggest impact in mobile rather than desktop, also means that high capability GPUs are essential for all levels of mobile devices. Even mid-range smartphones these days are required to perform increasingly challenging computational procedures at low cost, so the next generation of high end mobile devices to hit the market are facing ever higher requirements for superior graphics capability. One of the key requirements is to be able to deliver high-end mobile gaming and applications, whilst still constrained by the power (and therefore thermal) limits of mobile devices.

The ARM Mali family of GPUs is perfectly positioned to address the needs of the whole mobile industry from entry level to top of the line, next-generation products. Based on the Mali Midgard architecture, the Mali-T880 is scalable from one to sixteen cores. It features an additional arithmetic pipeline per shader core in order to handle greater computational throughput than its predecessors and is up to 40% more energy efficient. Not only that, but the adoption of bandwidth saving technologies such as ARM Frame Buffer Compression (AFBC), Smart composition and Transaction elimination all allow for the Mali-T880 to be extremely efficient on memory bandwidth.

ARM Mali-T880 GPU Structure

Two of ARM’s partners have recently released their take on the Mali-T880, Samsung and Huawei have both taken advantage of the exceptional performance and leading energy efficiency of the Mali-T880 to fulfil these very current needs. Huawei’s Mate 8 smartphone, announced last week, is based on the HiSilicon Kirin 950 chipset and contains an optimal combination of ARM Cortex-A72 and Cortex-A53 CPUs and the Mali-T880 GPU to provide a harmonized suite of efficient computing power. Samsung’s Exynos 8 Octa uses the Mali-T880 to ensure the highest quality user interface, immersive gaming and virtual reality content.

Huawei’s New Mate 8 Smartphone

2016 is set to feature yet another step forward in high end devices and with the launch of the Samsung Gear VR, consumer expectations for mobile graphics continue to grow. As more and more of ARM’s partners begin to utilise the Mali-T880, keep an eye out for the next generation of devices hitting the market in the new year.

Stay tuned for more news on Mali and Premium mobile, there are heaps of exciting developments to come.

↧

Shedding Light on Global Illumination Techniques

December 18, 2015, 7:25 am

≫ Next: Mali Christmas, Popular content of 2015

≪ Previous: Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

This blog is based on a recent article written by Geomerics for the publication Making Games. For more information on any of the techniques outlined in this blog, read more here.

Lighting is the most fundamental element of any game environment. Applied correctly, it brings three dimensional geometry to life, amplifies the emotions of a scene and draws the eyes to important areas of focus. Global illumination, the interplay of lights and materials, plays a significant role in creating this balanced scene lighting.

Image 1: The left image shows direct lighting only; the right image shows the difference the addition of global illumination makes.

It is vital that games industry members understand the variety of techniques available and adopt the one that is most suitable for their platform, environment, gameplay and performance budget. In this article we’ll explore a variety of options available, weigh up their advantages and limitations and conclude with recommendations for different scenarios.

1. Baked Lighting

Baked lighting is an extremely common technique used for both direct and indirect lighting information as well as other effects such as ambient occlusion. It fully pre-calculates lighting information for static scenes and stores the output in static data structures which are then consumed by fragment shaders. The most commonly used data structures are textures, spherical harmonics probes and cubemaps.

Baking can produce high quality results and is among the most affordable runtime techniques, enabling performance efficient indirect lighting even on mobile devices. However, everything is static at runtime both in the game and in the editor. Light sources, materials and meshes can’t be changed at runtime. The only option for dynamic content is to swap out entire data sets, which is very costly with respect to memory consumption as well as creating and managing this additional content. This either limits dynamic gameplay options or requires sacrifices with respect to visual quality or dynamic content.

In addition, baked lighting has the lengthiest workflow of all the global illumination solutions listed in this article. In order to see the effect of any changes made to a scene, no matter how small, artists need to run a full lighting bake. Given that in order to get a scene to AAA quality an artist will need to iterate on each individual light many times, baking can have a significant impact on a studio’s bottom line as well as the quality of the final result.

2. Light Propagation Volumes

The technique for light propagation volumes was proposed by Kaplanayan and Dachsbacher in 2010 as an alternative to traditional baked lighting and first implemented in CryEngine 3. The core idea behind this technique is to inject virtual point lights into a three-dimensional data structure, aggregate the injected lights for each cell, and finally iteratively propagate them to neighbouring cells.

Light propagation volumes allow lights, materials and objects to be fully dynamic at runtime. They work well with complex geometry and are relatively stable and flicker-free. At basic quality levels the technique can even scale down to mobile platforms. In addition, light propagation volumes can theoretically support multiple bounces, glossy reflections and participating media; however, in reality these extensions are considered too expensive for practical applications and the computational resources are rather used to improve the base lighting quality.

Image 2: Light propagation volumes can support participating media and glossy reflections

The technique has a number of issues that limit its application, the biggest being their low spatial resolution due to memory constraints even when employing cascades. Due to this, light and shadow leaking are a common occurrence and the lighting result is fairly ambient with low contrast. Combined with the fact that runtime performance scales with the number of light sources, light propagation volumes are mostly used for outdoor scenes with only a single directional light source.

3. Voxel Cone Tracing

While light propagation volumes calculate lighting for each cell and apply the indirect lighting result by a simple lookup into a volume texture, voxel cone tracing only calculates lighting in areas where there is actual geometry while performing a much more involved aggregation of the indirect lighting in the fragment shader.

Image 3: Comparison of light propagation volumes and voxel cone tracing

Voxel cone tracing provides an elegant light propagation and aggregation model for both indirect lighting as well as specular reflections. It is able to handle complex geometry including transparency and participating media while delivering smooth lighting results at a higher quality than light propagation volumes. However, it is very expensive to achieve these quality levels and the technique cannot scale to more budgeted platforms. Voxel cone tracing still suffers from significant light leaking issues even at medium quality levels and the directional information required to significantly improve quality would further increase memory cost.

4. Pre-computed Visibility

Determining the visibility between surfaces is the most expensive part of calculating global illumination. By pre-computing the surface-to-surface visibility information, compressing this data and only at runtime combining it with material and direct lighting information, it is possible to provide high quality dynamic global illumination effects across all gaming platforms – even on mobile.

Image 4: This graph visualizes how pre-computed visibility global illumination works

Pre-computed visibility global illumination is similar to baked lighting in respect to storing its data output in the same three data structures: lightmaps, spherical harmonics probes and cubemaps. The difference is that precomputed visibility makes dynamic global illumination effects such as time of day, changing materials and player controlled lighting efficient enough to occur at runtime no matter the platform. Additionally, once the initial pre-computation is done, all light sources and materials can be changed in real-time in the editor and the resultant effect on the global illumination can be seen instantly. This allows artists to iterate over the lighting of a scene much more efficiently than with baked lighting, resulting in huge time-savings and often better visual quality.

The limitations to this technique include the requirement for enough geometry in the scene to be static to enable a precompute – certain environments, for example procedurally generated, completely dynamic or user-generated content, are not suitable. And while it is still much faster than traditional baked lighting, the pre-computation step may well be an adjustment for artists who are used to other techniques.

ARM’s Enlighten is the industry standard for pre-computed visibility global illumination. It is easily accessible as the default global illumination solution inside Unity 5 and licensable for Unreal Engine 3 and 4 as well as for internal engines.

Conclusion

The aforementioned techniques for computing global illumination can broadly be sorted into two categories:

Surface Techniques	Volume Techniques
Baked lighting	Light propagation volumes
Pre-computed visibility	Voxel cone tracing

Surface techniques compute indirect lighting based on the transfer of light between two-dimensional surfaces, whereas volume techniques rely on an underlying three-dimensional data structure. Surface techniques require different degrees of pre-computation and offer limited support for dynamic objects. However, they are affordable at runtime, don’t suffer from light or shadow leaking and are the only techniques that have proven scalability down to mobile platforms.

So if you are developing a game right now, which global illumination technique is best for you? The answer to this question depends strongly on the game environment and target platforms you are considering.

Provided that your scenes are static, baked lighting is a great option for affordable lighting that scales to mobile.

If your game consists largely of outdoor areas and can be lit by a single directional light source, light propagation volumes is a techniques that you might want to consider.

For those titles which are intended only for high-end PCs, voxel cone tracing is an option. Yet, in many cases it is too expensive, especially with respect to memory consumption.

In all other cases, pre-computed visibility is a scalable, flexible and proven technique. If your game consists of enough static meshes to support the necessary surface-to-surface pre-computation, it offers the benefits of dynamism and rapid artist workflow along with the scalability enabled by baked lighting.

↧

Mali Christmas, Popular content of 2015

December 21, 2015, 4:36 am

≫ Next: ARM enables power efficient, enhanced visual experiences on 2.5K and 4K mobile devices with the Mali-DP650 Display Processor

≪ Previous: Shedding Light on Global Illumination Techniques

Hello everyone,

We're all looking forward to Christmas so I wanted to keep this one short, for those who might be checking in over the holidays here's some of the most popular content from the ARM Mali Graphics group this year, enjoy!

We hope you've found all our content this year to be useful and they'll be plenty more next year.

Missed the Graphics week back in April? Take some time to read the great blogs that came out of this week long Community event:

ARM Mali Graphics Week - April 13 -17, 2015

Virtual Reality was a hot topic this year, reflect with this series from freddijeffries:

Is Virtual Reality the Next Big Thing…Again?

Virtual Reality: A Focus on Focus

Virtual Reality: The Ice Cave

and learn about the Mali GPUs set to drive this innovation forward in 2016

Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

Why not learn more about the ARM Mali demos and some of the techniques from these popular blogs:

Inside the Demo: GPU Particle Systems with ASTC 3D textures

Dynamic Soft Shadows Based on Local Cubemap

also worth a look are the Reflections Based on Local Cubemaps in Unity

From Hours to Milliseconds: Project Ice Cave

Achieving the icy wall effect in the Ice Cave demo

Learn to achieve great lighting and Global Illumination in your games with Enlighten by Geomerics

Shedding Light on Global Illumination Techniques

Forging the Future of Video Games with Enlighten 3

Lighting in Games

Looking for some projects over Christmas? Perhaps you've got a new Chromebook (User Space drivers available here), why not get Linux set-up with the below guide :

Linux on Chromebook with ARM® Mali™ GPU

Take a look at the ARM tools and getting them set-up with your device:

lorenzodalcol dived into the Mali Graphics debugger with a two part case study:

Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel

Mali GPU Tools: A Case Study, Part 2 — Frame Analysis with Mali Graphics Debugger

Don't forget about DS-5 Streamline, this tool will help you find where your application's bottlenecks are, get it set-up on your ARM Mali device thanks to mcgeagh:

Using DS-5 Streamline with Mali on Samsung Galaxy Note 3 & 10.1

Using DS-5 Streamline with Mali on Samsung Galaxy Note 4

Using DS-5 Streamline with Mali on Samsung Galaxy SIII

Using DS-5 Streamline with Mali on Google Nexus 10

Working on performance on Mali in the new year? Learn from the expert with these series of blogs:

peterharris' series of Mali Performance blogs are a great read:

Mali Performance 1: Checking the Pipeline

Mali Performance 2: How to Correctly Handle Framebuffers

Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

Mali Performance 4: Principles of High Performance Rendering

Mali Performance 5: An Application's Performance Responsibilities

Mali Performance 6: Efficiently Updating Dynamic Resources

Mali Performance 7: Accelerating 2D rendering using OpenGL ES

Want to know more about the Mali GPU? Check out the below series also from peterharris:

The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining

The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering

The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

Merry Christmas,

Ryan

↧

ARM enables power efficient, enhanced visual experiences on 2.5K and 4K mobile devices with the Mali-DP650 Display Processor

January 21, 2016, 2:40 am

≫ Next: Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

≪ Previous: Mali Christmas, Popular content of 2015

Yesterday ARM introduced its highest performing and most power efficient display processor technology, the ARM® Mali™-DP650, which targets next generation premium and mass market devices.

The Mali-DP650 delivers performance and power efficiency gains of up to 2X compared to its predecessor, the Mali-DP550, in order to address the shift in mobile panel resolution from Full HD (1920x1080 pixels) to 2.5K formats including WQXGA (2560x1600 pixels) for tablets and WQHD (2560x1440 pixels) for a mobile screen. The Mali-DP550 remains the processor of choice for HD (1280x720 pixels) and Full HD devices.

Mali-DP650 offers enhanced 4K (3840x2160 pixels) capabilities and the ability to send 4K to a larger screen through WIFI streaming (e.g.Miracast) or via a connector (such as HDMI or USB-C). It enables more display processor use cases at 4K further reducing the requirement to engage other hardware blocks such as the GPU, thereby saving power and prolonging battery life.

Like its predecessor, the Mali-DP550, the Mali-DP650 offers an array of advanced features such as 7-layer composition, rotation, high quality scaling and a variety of pre and post-processing functions for video and graphics.

Mali-DP650: Performing functions such as composition, rotation, advanced scaling, colour conversion and tone mapping in a single pass through external memory

Addressing the performance challenges as mobile panels move towards 4K

Previously, it was thought that Full HD video and graphics content offered sufficient detail for a mobile screen. However, there is no doubt that being able to display video, UI and gaming content at even higher resolutions makes the overall output sharper and clearer. This has already led to a visible market presence of mobile screens capable of delivering higher resolutions - WQHD, WQXGA and even 4K is now starting to appear. This, however, poses a trade-off as handling higher resolution display layers natively means a costlier SoC in terms of both size and power consumption. Yet demand for such SoCs is growing and so therefore is the demand for effective solutions to address this issue with minimum impact on area, bandwidth and power.

Mali-DP650 doubles the available AXI bus width to 128-bits and adds significant memory subsystem optimizationswhich results in doubling the pixel throughput at the specified AXI clock rate. This exceeds performance expectations by being able to composite, process and display a combination of up to three full-frame 4K graphics and 10-bit video layers without any external intervention. This means that heavy composition use cases such as 4K video playback or gaming with UI overlays can be offloaded from the GPU onto the Mali-DP650 display processor, delivering power savings and allowing the GPU to focus on tasks it’s optimized for.

Mali-DP650 in a typical Mass Market Mobile system

Mali-DP650 also provides an MMU pre-fetcher solution to hide the delays caused by page table walks and enhances the overall interoperability with ARM’s CoreLink MMU-500 thus improving its overall tolerance to system latency. This translates to being able to build a lower cost interface to the display processor whilst still being able to deliver the same superior quality required by Full HD and beyond state-of-the-art mobile panels.

ARM Frame Buffer Compression (AFBC) is a system bandwidth saving technology which is known to deliver up to 50% system bandwidth saving if implemented throughout the multimedia system (GPU, Video and Display). Mali-DP650 implements AFBC and further optimizes AFBC efficiency by ensuring superior performance with DDR controllers.

Mali-DP650 display output is compatible with MIPI (Mobile Industry Processor Interface) DPHY transmitters. The Mali-DP650 offers split-display mode which is necessary to be able to talk to MIPI DPHY mobile panels with screen resolutions beyond Full HD. Split display mode allows for an image to be split into two equal parts at the display output thereby allowing two separate MIPI DSI transmitters to be driven in parallel. This doubles the available data rate making display resolutions up to 4k a reality on mobile or tablet devices.

Variable refresh rate saves power on the mobile panel

A display processor sends out pixel information to be displayed at a refresh rate that matches the rate at which the panel refreshes. In most systems today this is 60 times a second (60Hz/60fps). If the display processor did not send the information at the required rate then the panel would not be able to synchronize with the system and the visual output would not be displayed (or would be displayed with clear visual artefacts). Always sending output frames from the system to the panel at a fixed, and high, refresh rate has a key disadvantage. Even when the system is producing new visual information at a much slower rate (e.g. during video playback or when displaying a static webpage), the display processor is still running at the high refresh rate in order to keep the panel fed with information. In mobile, battery-powered systems this is detrimental to the battery life as the majority of the display subsystem must remain running continuously. To counter that, Mali-DP650 supports variable refresh - a technique whereby the output signal from the display processor is not transmitted at a constant and fixed rate but can vary on a per frame basis. The lowest frame rate will depend on the panel itself but a typical target would be 30Hz, which contributes to a longer battery life.

Enabling more composition use cases at resolutions beyond Full HD

Mali-DP650 can receive the 10-bit per pixel 4K video layer in either AFBC or YUV formats alongside additional UI layers (RGB888a or AFBC formats) and composite them with alpha blending. It can also rotate, downscale or upscale to the required resolution and even tone map the final output before sending it to the display. All of this functionality is performed in a single pass through the display processor without having to write back to external memory, thereby significantly reducing the overall system bandwidth and consequently, the power consumption. In addition, Mali-DP650 can be configured to act as a content passport for users as they move from a smaller mobile screen to a larger screen - 4K streaming video playback to external 4K displays via either HDMI, USB-C or through Wi-Fi transmission.

Another common video use case on a mobile device is a mobile video call via IP. This requires simultaneous video streams, one from the local camera and one from the other device, to be composited as a picture-in-picture (PIP) on the mobile display. The Mali-DP650 display processor enables PIP on the mobile display by allowing composition for the two video layers along with an additional graphics UI overlay, and can also upscale or downscale to meet the requirements of the local mobile panel. This again provides system-wide power savings by enabling composition, rotation and scaling without having to engage the GPU.

And it’s not just about the hardware…

Mali-DP650 is delivered with an Android HW Composer driver and an associated ADF-based Kernel Driver optimized to work alongside Mali GPU and Mali Video drivers. The ADF kernel device drivers are licensed under GPLv2 and are available in the driver section of the Mali Developer Centre. For more information on the Mali Display Android drivers, please check out the following blog: Do Androids have nightmares of botched system integrations?

In summary…

The Mali-DP650 display processor is 4k capable and optimised for 2.5k. It offers new features and performance enhancements that enable beyond Full HD mobile and tablet displays with optimal power usage. Mali-DP650 aims to drive new visual content and gaming experiences on mobile devices and act as a content passport for users as they move from a smaller mobile screen to a larger screen, such as a 4K smart television. Mali-DP650 further expands the Mali family of display processors, including Mali-DP550, which remains the processor of choice for mobile panels below Full HD. Display processors are critical for enabling the best overall user experience and prolonging the battery life of the device and we expect the Mali-DP650 to begin appearing in devices from early 2017.

↧

Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

January 25, 2016, 4:00 am

≫ Next: Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

≪ Previous: ARM enables power efficient, enhanced visual experiences on 2.5K and 4K mobile devices with the Mali-DP650 Display Processor

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

Introduction

This is the first article of three that will focus on the implementation of Fast Fourier Transform(FFT) using the mixed-radix method on Mobile ARM® Mali™ GPU by means of OpenCL™.

This blog series continues the work of neiltan who analyzed the main strategies for optimizing the radix-2 FFT algorithm in his blog Optimizing Fast Fourier Transformation on ARM Mali GPUs.

In this first post, we are going to build up a minimal background for the 1D complex to complex FFT algorithm by starting to point out the limits of DFT using the direct computation and exploiting the well-known and commonly used radix-2 FFT.

The mathematical prospective of FFT mixed-radix will be used to explain the key concepts of digit-reverse, twiddle factor multiplication and radix computation. This will be covered extensively in the second article from the point of view of an implementation in OpenCL.

Before starting, we'd like to thank Georgios Pinitas and Moritz Pflanzer for their important help in reviewing this article and for their contribution to this implementation.

Discrete Fourier Transform

The Discrete Fourier Transform (DFT) is a domain transform (time -> frequency, image -> frequency,..) widely used in real-time digital signal processing for a variety of tasks such as spectrum analysis and convolution in the frequency domain.

It is defined as follow:

where:

x(n) where n [0, N-1] is the n-th element of the input complex data sequence uniformly sampled
X(k) where k [0, N-1] is the k-th element of the output complex DFT

The inverse function of the DFT, known as the Inverse Discrete Fourier Transform (IDFT), allows us to obtain the original function.

Commonly the above expressions are written in the following tightly-packaged forms:

with

Convolution in the frequency domain corresponds to a simple pointwise multiplication

Analyzing the complexity of the direct computation of DFT

Generally the complexity of an algorithm can be defined in terms of a number of multiplications. We can easily infer the complexity of direct computation decomposing the N-point DFT in the real and imaginary parts:

where in the above relation we have applied the following trigonometric relation:

Equating the real and imaginary part:

the direct computation would require:

2 trigonometric multiplications for N iterations for the real part = 2*N
2 trigonometric multiplications for N iterations for the imaginary part = 2*N

Since we have to compute K output complex values, we have:

2 * N * K for the real part
2 * N * K for the imaginary part

therefore the total multiplications are: 2*N^2 + 2*N^2 = 4 * N^2

declaring the complexity of direct computation as O(N^2).

Fast Fourier Transform - Cooley–Tukey algorithm

FFT is an efficient algorithm for producing exactly the same result (in the limit of precision) as DFT but with a complexity of O(N log2 N).

This method (and the general idea of FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965, but it was later discovered that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms).
https://en.wikipedia.org/wiki/Fast_Fourier_transform

The most famous and commonly used FFT's algorithm is radix-2 Cooley–Tukey. It works just with N power of 2 and behind this approach we find the "divide et impera" strategy which recursively breaks down the DFT into many smaller DFTs of size 2 called radix-2.

Essentially it is like we are seeing N composed by log2(N) 2's factors (i.e. N = 2 x 2 x..)

Looking at the picture we have:

log2(N)radix-2 stages. Each one composed of N/2 radix-2 basic elements
(log2(N) - 1)twiddle factor multiplications stages

Please note that after the first radix stages, the inputs of each radix basic element will not be in consecutive order. In particular there will be an offset called "span" between each input which will depend on the radix stage.

From radix-2 to mixed-radix

Often for signal processing applications we design our system with radix-2 FFTs in mind. However, having N power of 2 for exactly fitting radix-2 FFTs can cause a significant performance drop in some applications.

For instance, if we consider the worst case for a 1D FFT, forcing the use of radix-2 would require double the amount of data input and consequently double the amount of computation. Moreover, this loss of performance would be much worse for the 2D case. In applications of image processing we have quite often dimensions not power of 2 and then doubling the size of both rows and columns, it would increase by a factor of 4 the total FFT size.

However, even if we have previously factorized N as N = 2 x 2 x 2..., any factorization would generally be possible.

Factorizing N as N = N1 x N2 x N3 x .. the "divide et impera" strategy remains the same but the new algorithm recursively breaks the DFT into many smaller DFTs of sizes N1, N2, N3 called respectively radix-N1, radix-N2, radix-N3,..

The generalization of the basic radix-2 FFT is called mixed-radix.

In terms of pipeline stages we are going to have:

N / Ns DFTs of length Ns for each radix stage, where Ns is the radix order
Z radix's stage with Z equal to the number of factors used for compounding the original length N.

From this first analysis we can underline that the factorization is the fundamental principle behind the Cooley–Tukey algorithm and the radix-2 algorithm is just a special case of mixed radix.

Virtually mixed-radix allows us to compute the FFT on any length N of input data but as typical implementations use a limited number of optimized radixes such as radix-2, radix-3, radix-5, radix-7, the computation will be restricted to those N that are an expansion of these available factors. In our case we are going to use:

radix-2
radix-3
radix-4
radix-5
radix-7

so N must be compound of power of 2,3,5,7.

Please note: In the factorization of N there is not 4 as it is a power of 2. The motivation behind the choice to implement an highly optimized radix-4 will be clear in the second article where we will compare the performance of radix-2 and mixed-radix algorithms.

Theory behind mixed-radix

This section will detail the maths behind the Cooley-Turkey mixed-radix algorithm and will explain concepts such as twiddle factor multiplication and digit-reverse.

Since factorization is the key principle of Cooley–Tukey mixed-radix algorithm, let's start to decompose N in 2 factors, Nx and Ny:

Let's arrange the input data x(n) in a matrix with Nx columns and Ny rows where n-th element can be addressed in the following manner:

with:

nx = n % Nx, scans the columns - nx [0, Nx - 1]
ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]

i.e.

If we had N = 21 and we factorized N as N = 7 x 3, the 16-th element of the input array would be addressed:

ny = floor(16 / 7) = 2
nx = 16 % 7 = 2

address of 16-th element = 2 + 2 * 7

using this notation, the DFT can be written as follow:

Now let's place the output elements X(k) in a matrix as well but with Nx rows and Ny columns. The motivation behind this choice will be clear during the demonstration.

The k-th element will be addressed with the following relation:

For addressing the output elements, we use kx and ky:

kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
ky = (k % Ny), scans the columns - ky [0, Ny - 1]

Addressing the output elements is completely upside-down because ky scans the columns whilst kx the rows.

and the DFT becomes:

Let's evaluate the complex exponential:

Term 1:Always 1 because multiple of 2π
Term 4: Kind of trigonometric complex constant (twiddle factor)

The exponential is easy to study thanks for choosing the output matrix with Nx rows and Ny columns.

Writing the 4th term as:

the final DFT's expression becomes:

This is the expression behind the computation of mixed-radix algorithm in the case of 2 factors.

If we had for instance N = 21 and we factorized N as N = 7 x 3, we could graphically summarize this final expression in the following manner:

Which can be translated in the following pipeline:

Summary

If Nx and Ny were composite numbers, we could again apply the decomposition to Nx and Ny
The outer sum in the final expression is the DFT of each row after first multiplying by a proper trigonometric complex constant called "twiddle factor"
The final result does not change by swapping the order of the factors
Looking at the picture we can easily guess that each stage of the pipeline is an embarrassingly parallel problem.

In the end from the demonstration we can highlight the 3 main computation blocks behind this algorithm:

Digit-reverse stage in order to have the output complex values in the correct order
Twiddle factor multiplication stage which multiplies each value by a proper complex constant before transmitting to the next radix stage
Radix stage which computes an highly optimized DFT

With this first article we have finished to illustrate the minimal background behind FFT mixed-radix algorithm and we are definitely ready to dive in the next article where we are going to play with the OpenCL kernels

Ciao,

Gian Marco

ReadSpeeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

ReadSpeeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

References

https://community.arm.com/groups/arm-mali-graphics/blog/2014/05/30/fast-fourier-transformation-vectorization-on-arm-mali-gpus
http://www.engineeringproductivitytools.com/stuff/T0001/PT07.HTM
Fast Fourier Transform - Algorithms and Applications, K. R. Rao, Do Nyeon Kim, Jae Jeong Hwang
The Scientist and Engineer's Guide to Digital Signal Processing, Steven W. Smith

↧

Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

February 1, 2016, 3:15 am

≫ Next: ARM Vulkan: Built for Modern Mobile Systems

≪ Previous: Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

Here we are for the second part of our blog series about the OpenCL™ implementation of Complex to Complex Fast Fourier Transform based on the method mixed-radix on Mobile ARM® Mali™ GPUs.

Whilst in the first article - Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1 - we presented the basic mathematical background, in the following we are going to play with the 3 main computation blocks behind the FFT mixed-radix which will be analyzed in a step-by-step approach from the point of view of both theory and of its efficient and simple implementation with OpenCL.

The development platform used to carry out our performance analysis will be the Firefly-RK3288 which is an all-in-one high performance multi-core computing platform based on the Rockchip RK3288 SoC featuring an ARM^® Cortex^®-A17 @1.8GHz and ARM^® Mali™-T760 (Mali GPU Driver: r6p0-02rel0).

For all performance evaluations, DVFShas been disabled and theGPU clock has been set at the maximum frequency of 600MHz.

Further information about the development platform can be found at:

Implementation

Let's take a look at how we store our complex values. The input and output data are sorted into a floating point complex array with 2*N elements where the real and imaginary parts are placed alternately. Two different buffers with 2*N elements have been used to store the input and output values.

The FFT mixed-radix pipeline that we are going to study is composed mainly of 3 computation blocks:

Digit-reverse
Twiddle factor multiplication
Radix computation

Indexing scheme

The only real complexity of FFT mixed-radix is the indexing scheme. The indexing scheme expresses the relation between the index n and kbetween 2 generic stages X and Y. This relation is fundamental because it allows both to know which input must be processed by each radix basic element and to correctly perform the stages of digit reverse and twiddle factors multiplication.

Given 2 generic stages X and Y:

n addresses the values in the stage X: n = nx + ny * Nx
k addresses the values in the stage Y: k = ky + kx * Ny

with:

nx = n % Nx, scans the columns - nx [0, Nx - 1]
ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]
kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
ky = (k % Ny), scans the columns - ky [0, Ny - 1]

From the above picture we can easily see what nx, ny, kx and ky are. For what concerns Nx and Ny, in the case of just 2 radix stages they are respectively the radix order of stage X and Y. In the case of more than 2 radix stages, Ny is still the radix order of stage Y but Nx becomes the radix products from the first radix stage to the radix stage X.

Nx is also known as span or butterfly span. The span is the offset between each complex input of radix basic element and it was introduced in the first article in reference to the radix-2 FFT algorithm.

i.e.

Let's assume we have N = 8 x 4 x 4 x 3 x 2 (5 stages).

If the stages X and Y were stage 2 and stage 3, Nx would be Nx = 8 x 4 x 4 = 128 and Ny = 3

With a pipeline made up of M radix stages,we would like to have a relation between the index k and n between 2 generic stages, in particular we would like to have something like:

The general expressions for computing these mapping functions are:

where Ni is:

/**
 * @brief Map index k to index n
 *
 * @param[in] k  Index k
 * @param[in] Nx It is the span
 * @param[in] Ny It is the radix order of stage Y
 *
 * @return       Index n
 */
uint map_k_to_n(uint k, uint Nx, uint Ny)
{    uint Ni = Nx * Ny;    uint ky = k % Ny;    // Remainder    uint kx = k / Ny;    // Integer part    uint n = (kx % Nx) + (kx / Nx) * Ni + ky * Nx;    return n;
}

/**
 * @brief Map index n to index k
 *
 * @param[in] n  Index n
 * @param[in] Nx It is the span
 * @param[in] Ny It is the radix order of stage Y
 *
 * @return       Index k
 */
uint map_n_to_k(uint n, uint Nx, uint Ny)
{    uint Ni = Nx * Ny;    uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni);    return k;
}

Every time we compute a radix stage, it is good to keep the span Nx update.

/* Init Nx to 1 */
uint Nx = 1;

/* Scan each radix stage */
for(uint s = 0; s < n_stages; ++s)
{     /* Get radix order of stage s */     uint Ny = radix[s];         /* Body for computing twiddle factor multiplication and radix computation */     ...         /* Update Nx */     Nx *= Ny;
}

Digit-reverse

Once we have introduced the indexing scheme, we are ready to analyze the main computation blocks of our pipeline. Let's start with the digit-reverse.

The digit-reverse is the first stage of our pipeline which places the input elements in a specific order called "digit-reverse order".

The digit-reverse order would be exactly the order of the output elements if we left the input elements in linear order (0, 1, 2,...).

Since we know the relation between the index n and k, a possible way of knowing the digit-reverse order may be to iterate the mapping function map_n_to_k()from the first to the last stage in order to know how the input index n would be mapped at the end.

/**
 * @brief This function computes the digit reverse index for each input
 *
 * @param[in]  stage             It contains the radix order for each radix stage
 * @param[out] idx_digit_reverse It contains the digit-reverse order index
 * @param[in]  n_stages          Total number of radix stages
 * @param[in]  N                 Number of input
 */
void digit_reverse(float* stage, uint* idx_digit_reverse, uint n_stages, uint N)
{    /* Scan elements */    for(uint n = 0; n < N; ++n)    {        uint k = n;        uint Nx = stage[0];        /* Scan stages */        for (uint s = 1; s < n_stages; ++s)        {            /* radix of stage s-th */            uint Ny = stage[s];            uint Ni = Ny * Nx;            /* Update k index */            k = (k * Ny) % Ni + (k / Nx) % Ny + Ni * (k / Ni);            /* Update Nx */            Nx *= Ny;        }        /* K is the index of digit-reverse */        idx_digit_reverse[n] = k;    }
}

Once we know the digit-reverse index, we can implement a CL kernel that places the input values in the required order.

/**
 * @brief This kernel stores the input in digit-reverse order
 *
 * @param[in]  input            It contains the input complex values in linear order
 * @param[out] output           It contains the output complex values in digit-reverse order
 * @param[in]  idx_bit_reverse  It contains the digit-reverse order index
 */
kernel void digit_reverse(global float2* input, global float2* output, global uint* idx_digit_reverse)
{    /* Each work-item stores a single complex values */    const uint n = get_global_id(0);    /* Get digit-reverse index */    const uint idx = idx_digit_reverse[n];    /* Get complex value */    float2 val = (float2)(input[idx]);    /* Store complex value */    output[n] = val;
}

The digit-reverse kernel uses 2 different cl_buffers:

one for loading the input complex values
one for storing the output complex values in digit-reverse order.

Twiddle Factor Multiplication

Twiddle factor multiplication is the intermediate stage between 2 generic radix stages X and Y. In particular it multiplies all the N output complex elements of radix stage X by a specific trigonometric complex constant before passing those values to the next stage Y.

This trigonometric complex constant -the "twiddle" factor - depends on nx and ky:

Even the twiddle factor multiplication is highly parallelizable and can be implemented by means of OpenCL in the following manner:

#define M_2PI_F 6.283185482025146484375f

#define TWIDDLE_FACTOR_MULTIPLICATION(phi, input)                 \
{                                                                 \    float2 w, tmp;                                                \    w.x = cos(phi);                                               \    w.y = sin(phi);                                               \             tmp.x = (w.x * input.x) - (w.y * input.y);                    \    tmp.y = (w.x * input.y) + (w.y * input.x);                    \    input = tmp;                                                  \
}

/**
 * @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
 *
 * @param[in, out] input It contains the input and output complex values
 * @param[in]      Nx    It is the span
 * @param[in]      Ny    It is the radix order of stage Y
 * @param[in]      Ni    Nx * Ny
 */
kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni)
{    /* Each work-item computes a single complex input */    const uint n = get_global_id(0);    /* Compute nx */    uint nx = n % Nx;    /* Compute k index */    uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni);    /* Compute ky */    uint ky = k % Ny;    /* Compute angle of twiddle factor */    float phi = (-M_2PI_F * nx * ky) / (float)Ni;    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}

The above kernel uses the same cl_buffer for loading and storing.

An interesting optimization concerns the computation of ky. As we know,

so:

since in the above expression A and B are multiples of Ny, the terms 1 and 3 will be 0therefore:

but for the identity propertyof the modulo operator we have:

and then we can compute ky as:

At this point we can simplify our OpenCL kernel in the following manner:

kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni)
{    /* Each work-item computes a single complex input */    const uint n = get_global_id(0);    /* Compute nx */    uint nx = n % Nx;    /* Compute ky */    uint ky = (n / Nx) % Ny;               // <-    /* Compute angle of twiddle factor */    float phi = (-M_2PI_F * nx * ky) / (float)Ni;    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}

Another important and simple optimization that we can introduce concerns the computation of the angle for the twiddle factor multiplication. As we can observe, the term (-M_2PI_F / (float)Ni) is a constantfor all work-items and can therefore be passed as a argument to the CL kernel.

/**
 * @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
 *
 * @param[in, out] input     It contains the input and output complex values
 * @param[in]      Nx        It is the span
 * @param[in]      Ny        It is the radix order of stage Y
 * @param[in]      Ni        Nx * Ny
 * @param[in]      exp_const (-M_2PI_F / (float)Ni)
 */
kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni, const float exp_const)
{    /* Each work-item computes a single complex input */    const uint n = get_global_id(0);    /* Compute nx */    uint nx = n % Nx;    /* Compute ky */    uint ky = (n / Nx) % Ny;    /* Compute angle of twiddle factor */    float phi = (float)(nx * ky) * exp_const;                    // <-    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}

With these 2 simple optimizations, we are able to improve the performance of this kernel by a factor of~1.6/1.7x as illustrated in the following graph.

Radix computation

The radix stage is the main computation block of the pipeline composed of N / Sradix-S basic elements.

Each radix-S consists of S inputs and S outputs and computes an optimized DFT of length S.

Since there is not a dependency between each radix basic element of the same stage, even the radix stage is an embarrassingly parallel problem.

In our implementation each radix-S of the same radix stage is performed by a single work-item. As in the first stage the radix basic element takes the input in linear orderwhilst for the following stages accordingly with the span Nx, we are going to use 2 different kernels for the radix basic element:

One kernel for the first stage which takes the input in linear order
One kernel for the other radix stages

Both kernels use a single cl_buffer for loading and storing.

In the following we are going to describe the implementation of radix 2/3/4/5/7.