Quantcast
Channel: ARM Mali Graphics
Viewing all 266 articles
Browse latest View live

ARM enables smartphone-quality visuals on wearable and IoT devices with the Mali-470 graphics processor

$
0
0

Yesterday, the Media Processing Group at ARM announced a new highly-efficient graphics processing unit (GPU), the ARM® Mali™-470, to enable smartphone-quality visuals on wearable and IoT devices:

 

ARM Mali-470 GPU Offers Improved Efficiency and Experiences on Wearable and IoT Devices

 

A growing market with unique challenges

 

The wearables market has been growing steadily for many years with more and more devices and applications entering the market. ARM has long been associated with wearables, with many devices based on ARM technologies and more recently with the “Wearables for good” challenge in partnership with UNICEF and frog. That association now extends to graphics processing with the Mali-470.

 

Mali-470 is the latest in the Mali-400 series of graphics processors that run applications using the ubiquitous OpenGL® ES 2.0 graphics standard. The Mali-400 family has shipped in more than a billion devices worldwide and is favoured where efficient graphics processing is a must. An example being the growing number of System-on-Chips (SoCs) that are designed specifically for wearable and IoT applications, such as Mediatek’s MT2601 SoC, announced earlier this year in support of Google’s Android Wear software:

 

MediaTek Introduces MT2601 in Support of Google’s Android Wear Software

 

The key advantage of Mali-470 is that it consumes half the power of the Mali-400 GPU, helping device manufacturers bring the smartphone user experience to environments with even greater power-constraints.

 

Expanding the smartphone user experience


small_expanding_the.jpg

 

We’ve all become accustomed to high-quality visuals, backed by a touchscreen, as the most intuitive way of interacting with our smartphones and tablets. When we use other types of device, we want to interact with them in a very similar way.

 

For those of us who remember Video Cassette Recorders, and the frustration of trying to program the timer for the first time, it’s hard to imagine anyone tolerating that kind of user experience ever again. Yet, across many devices, the user interface quality has fallen far behind that of our smartphone.

 

From watches to thermostats, industrial control panels in factories and warehouses, multi-function printers in offices, infotainment systems in cars and home appliances, highly efficient graphics processing is essential to render intuitive user interfaces.

 

The challenge many of these devices face is power consumption and how to reduce it as the interface becomes more sophisticated – we think Mali-470 is the answer.

 

Why OpenGL ES 2.0?

 

Every pixel matters in delivering high-quality user interfaces. This is especially true for smaller screens where every pixel must play a role in conveying information clearly or providing intuitive controls or both.

 

The majority of Android™, Android Wear and other emerging operating systems, such as Tizen™, use OpenGL ES 2.0 for modern user interfaces, mapping, casual gaming, etc. OpenGL ES 2.0 offers the ideal balance between per-pixel control with programmable shaders and energy-efficiency. Mali-470 uses the same industry-standard OpenGL ES 2.0 driver stack as the Mali-400 GPU so there is no need to re-optimise existing applications – anything written for Mali-400 will work seamlessly on the Mali-470 GPU.

 

More recent versions of OpenGL ES have introduced a number of additional features to support immersive video games; however the OpenGL ES 2.0 feature level is the most efficient for user interfaces that appear on wearable and IoT devices.

 

why_gles.jpg

Half the power consumption

 

Building on the success of the Mali-400 GPU, Mali-470 delivers the same rich performance at the same process geometry while halving the power consumption. This provides SoC manufacturers with scalable options to enable them to create embedded graphics subsystems that meet the needs of new low-power devices.

 

energy-efficiency.jpg

 

Mali-470 achieves this by building on the energy-efficiency gained in Mali-450 and applying focussed design changes to the Vertex and Fragment Processors. This results in half the power consumption with the same performance when compared to the Mali-400. Vertex processors construct the “wire frame” of a scene and the fragment processors perform the per-pixel shading, colours and effects such as transparency. For wearable device resolutions a single fragment processor is sufficient, but Mali-470 has the ability to scale to four fragment processors to support the higher resolutions of devices with larger screens.

 

processor-diagram.jpgMali-470 block diagram: Up to 4 pixel processors can be implemented and this multi-core

design supports screen resolutions from 640x640 to 1080p at 60FPS 32bpp

 

The design improvements in Mali-470 can be grouped into three areas of equal importance: Quad-thread scheduling, Microarchitectural and Datapath optimisations.

 

Quad-thread scheduling optimisations:

  • Enforcing the grouping of quads (2x2 pixel threads) so that the frequency of control and state updates within the pipelines are significantly reduced.
  • Optimising many of the functional blocks to operate on quads.
  • Centralising a subset of per-quad state and accessing it only when necessary, rather than clocking it through the pipelines

Microarchitectural optimisations:

  • Making aggressive use of clock-gating throughout the design, including clock-gating of all function-orientated L1 caches
  • Bypassing functional blocks whenever instruction execution can proceed without them

Datapath optimisations:

  • Optimising datapaths to make targeted use of fixed-point arithmetic, rather than floating-point arithmetic for vertex processing

 

Wearables and beyond…

 

Designed for wearables and IoT devices, the Mali-470 GPU will benefit a multitude of devices that require a rich UI and where energy-efficiency is important, especially when coupled with ARM CPUs such as the Cortex®-A7 and A53 processors. You can see some of the possibilities below:

 

devices.jpg

 

To summarise, the Mali-470 graphics processor further expands the smartphone experience into a wider range of devices including wearables, home gateways and appliances, industrial control panels, healthcare monitors and even new entry-level smartphones.

 

With half the power consumption of the billion selling Mali-400 GPUs, Mali-470 opens the door for more vibrant user interfaces and provides exciting opportunities for designers to innovate with graphics in even more power-constrained environments. We expect to see Mali-470 appearing in first devices from early 2017.


Improving ARM® Mali™ drivers on fbdev

$
0
0

This blog post refers to the public ARM Mali Midgard r6p0 user-space binary drivers for GNU/Linux which are now available for download.  The "fbdev" variant now has support for dma-buf using a standard ioctl, which is explained in detail here.

 

What is wrong with fbdev?

 

The Linux kernel defines a user-space API that lets applications control displays via frame buffer drivers, also known as "fbdev" drivers.  This is achieved via a set of file operations on "/dev/fb*" devices such as ioctl or mmap, which allows direct access to the pixel data of a display.  While this is a simple, widespread, lightweight and powerful interface, it is no longer suitable in the world of modern embedded graphics computing.

 

For example, it does not provide any means to share the frame buffer directly with another kernel driver such as a GPU driver.  When running an application that renders graphics with a GPU, each frame contains pixels that the GPU generated within its own buffers.  The display controller is a separate entity with its own frame buffer located elsewhere in memory.  So the standard way of displaying the GPU frames in fbdev mode is for the user-space (EGL part of the driver) to copy the data from a GPU buffer to the display controller's frame buffer.  This is done on the CPU, typically by calling memcpy.  This works, but can easily become the biggest bottleneck in the system and is therefore not acceptable in a real product.

 

There are many other limitations to "fbdev" which are not covered in this blog post, such as managing layers, synchronising applications with the display controller or performing 2D composition in hardware.

 

What can we do to fix it?

 

All new display controller drivers should be implementing the Direct Rendering Manager (DRM) API instead of fbdev.  DRM offers a much more modern set of features and solves many of the problems seen with fbdev.  However, there are still a lot of GPU-accelerated products that control their display using an fbdev Linux driver.  Work needs to be done on each of these to avoid the unacceptably inefficient CPU memcpy and to let the GPU directly write into the frame buffer.

 

A typical way of achieving this is by using the dma-buf framework in the Linux kernel.  Typically, the frame buffer driver registers itself as a dma-buf exporter and implements a way for user-space to get a file descriptor for this frame buffer (an ioctl).  The user-space driver then passes the dma-buf file descriptor to the GPU kernel driver to import and use.  When rendering happens, the GPU pixels are directly written into the frame buffer using a hardware DMA operation - this is also referred to as "zero-copy" and is much faster than calling memcpy.  However, there is no standard ioctl in the fbdev API to export the frame buffer with dma-buf so each fbdev driver has a slightly different one.  This means the user-space needs to be modified to work with each fbdev driver, which is not compatible with the public standard Mali binary drivers as they must not depend on platform specific display drivers.

 

FBIOGET_DMABUF

 

This is why, starting with r6p0, we are adding a more generic ioctl implementation to export the dma-buf file descriptor defined as a custom extension to the standard fbdev API in supported kernels: the FBIOGET_DMABUF ioctl.  This way, no extra dependency is being added on the user-space binary.  If the ioctl is not available on a given kernel, then the user-space will revert to memcpy.  We have already enabled this in our ODROID-XU3 Linux kernel branch on Github.

 

We should keep this added functionality in all our new fbdev Mali Midgard binary drivers and Linux kernels for supported platforms, such as ODROID-XU3, Chromebook devices with an ARM Mali GPU, Firefly...

Eats, Shoots and Interleaves

$
0
0

Vertex interleaving

 

Recently we were asked via the community whether there was an advantage in either interleaving or not interleaving vertex attributes in buffers. For the uninitiated, vertex interleaving is a way of mixing all the vertex attributes into a single buffer. So if you had 3 attributes (let’s call them Position (vec4), Normal(vec4), and TextureCoord(vec2)) uploaded separately they would look like this:

 

P1xP1yP1zP1w , P2xP2yP2zP2w , P3xP3yP3zP3w ... and so on

N1xN1yN1zN1w , N2xN2yN2zN2w , N3xN3yN3zN3w ... and so on

T1xT1y , T2xT2y , T3xT3y ... and so on

 

(In this case the commas denote a single vertex worth of data)

The interleaved buffer would look like this:

 

P1xP1yP1zP1w N1xN1yN1zN1w T1xT1y , P2xP2yP2zP2w N2xN2yN2zN2w T2xT2y ,

P3xP3yP3zP3w N3xN3yN3zN3w T3xT3y ... and so on

 

(Note the colours for clarity)

 

… Such that the individual attributes are mixed, with a given block containing all the information for a single vertex. This technique is what the stride argument in the glVertexAttribPointer function and its variants is for, allowing the application to tell the hardware how many bytes it has to jump forwards to get to the same element in the next vertex.

 

However, even though we all knew about interleaving, none of us could really say whether it was any better or worse than just putting each attribute in a different buffer, because (to put it bluntly) separate buffers are just easier to implement.

 

So in a twist to usual proceedings I have conferred with arguably the top expert in efficiency on Mali,  Peter Harris. What follows is my interpretation of the arcane runes he laid out before my quivering neurons:

 

Interleaving is better for cache efficiency…

 

… Sometimes.

 

Why does interleaving work at all?

 

The general idea behind interleaving is related to cache efficiency. Whenever data is pulled from main memory it is loaded as part of a cache line. This single segment of memory will almost certainly contain more than just the information desired, as one cache line is larger than any single data type in a shader program. Once in the local cache the data in the loaded line is more quickly available for subsequent memory reads. If this cache line only contains one piece of required information, then the next data you need is in a different cache line which will have to be brought in from main memory. If however, the next piece of data needed is in the same cache line, the code can fetch directly from the cache and so performs fewer loads from main memory and therefore executes faster.

 

Without getting into physical memory sizes and individual components, this can be illustrated thusly:

 

Imagine we have 3 attributes, each of them vec4s. Individually they look like this:

 

| P1 P2 | P3 P4 | ...

| N1 N2 | N3 N4 | ...

| T1 T2 | T3 T4 | ...

 

From this point forward those vertical lines represent the boundaries between cache lines. For the sake of argument, the cache lines in this example are 8 elements long, so contain 2 vec4s; but in the real world our cache lines are 64 bytes, large enough to hold four 32-bit precision vec4 attributes. For the sake of clear illustration I’ll be keeping the data small in these examples, so if we want all the data for vertex number 2 we would load 3 cache lines from the non-interleaved data:

 

P1 P2

N1 N2

T1 T2

 

If this data is interleaved like so:

 

| P1 N1 | T1 P2 | N2 T2 | P3 N3 | T3 P4 | N4 T4 | ...

 

The cache lines fetched from main memory will contain:

 

T1 P2

N2 T2

 

(We start from T1 because of the cache line alignment)

 

Using interleaving we've performed one less cache line fetch. In terms of wasted bandwidth, the non-interleaved case loaded 3 attributes which went unused, but only one unused attribute was fetched in the interleaved case. Additionally, it's quite possible that the T1 P2 cache line wouldn't need to be specifically fetched while processing vertex 2 at all; if the previously processed vertex was vertex 1, it is likely that the data will still be in the cache when we process vertex 2.

 

Beware misalignment

 

Cache efficiency can be reduced if the variables cross cache line boundaries. Notice that in this very simple example I said the texture coordinates were vec4s. Ordinarily textures would be held in vec2 format, as shown in the very first explanation of interleaving. In this case, visualising the individual elements of the buffer, the cache boundaries would divide the data in a very nasty way:

 

PxPyPzPw NxNyNzNw | TxTy PxPyPzPw NxNy | NzNw TxTy PxPyPzPw | …

 

Notice that our second vertex's normal is split, with the x,y and z,w in different cache lines. Though two cache lines will still contain all the required data, it should be avoided as there is a tiny additional power overhead in reconstructing the attribute from two cache lines. If possible it is recommended to avoid splitting a single vector over two cache lines (spanning a 64-byte cache boundary), which can usually be achieved by suitable arrangement of attributes in the packed buffer. In some cases adding padding data may help alignment, but padding itself creates some inefficiencies as it introduces redundant data into the cache which isn’t actually useful. If in doubt try it and measure the impact.

 

But it's not always this simple

 

If we look at the function of the GPU naïvely, all of the above makes sense, however the GPU is a little cleverer than that. Not all attributes need to be loaded by the vertex processor. The average vertex shader looks something like this:

 

     uniform vec4 lightSource;

     uniform mat4 modelTransform;

     uniform mat4 cameraTransform;

 

     in vec4 position;

     in vec4 normal;

     in vec2 textureCoord;

     in vec2 lightMapCoord;

 

     out float diffuse;

     out vec2 texCo;

     out vec2 liCo;

 

     void main( void ){

          texCo = textureCoord;

          liCo = lightMapCoord;

          diffuse = dot((modelTransform*normal),lightSource);

          gl_Position=cameraTransform*(modelTransform*position);

     }

 

If you look at our outputs diffuse is calculated at this stage, as is gl_Position, but texCo and liCo are just read from the input and passed straight back out without any computation performed. For a deferred rendering architecture this is really a waste of bandwidth as it doesn’t add any value to the data being touched. In Midgard family GPUs (Mali-T600 or higher) the driver understands this (very common) use case and has a special pathway for it. Rather than load it in the GPU and output it again to be interpolated, the vertex processor never really sees attributes of this type. They can bypass the vertex shader completely and are just passed directly to the fragment shader for interpolation.

 

Here I've used a second set of texture coordinates to make the cache align nicely for this example. If we fully interleave all of the attributes our cache structure looks like this

 

PxPyPzPw NxNyNzNw | TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy | ...

 

Here the vertex processor still needs to load in two attributes P and N, for which the cache line loads will either look like:

 

PxPyPxPw NxNyNzNw

 

… or …

 

TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy

 

… to obtain the required data, depending on which vertex we are loading. In this latter case the T and L components are never used, and will be loaded again separately to feed into the interpolator during fragment shading. It’s best to avoid the redundant data bandwidth of the T and L loads for the vertex shading and the redundant loads of P and N when fragment shading. To do this we can interleave the data into separate buffers, one which contains all of the attributes needed for computation in the vertex shader:

 

PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | ...

 

… and one containing all of the attributes which are just passed directly to interpolation:

 

TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | ...

 

This means that the vertex shader will only ever need to touch the red and green cache lines, and the fragment interpolator will only ever have to touch the blue and orange ones (as well as any other interpolated outputs from the vertex shader). This gives us a much more efficient bandwidth profile for the geometry processing. In this particular case it also means perfect cache alignment for our vertex processor.

 

A note on data locality

 

Caches function best when programs make use of the data in the same cache lines in a small time window. This maximizes the chance that data we have fetched is still in the cache and avoids a refetch from main memory. Cache lines often contain data from multiple vertices which may come from multiple triangles. It is therefore best practise to make sure that these adjacent vertices in memory are also nearby in the 3D model (both in terms of attribute buffers and index buffers). This is called data locality and you normally need look no further than your draw call's indices (if you are not using indexed models you have far bigger problems than cache efficiency to solve). If the indices look like this:

 

(1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 2) (1, 3, 5) ...

 

You have good data locality. On the other hand, if they look like this:

 

(1, 45, 183) (97, 12, 56) (4, 342, 71) (18, 85, 22) ...

 

… then they're all over the place and you’ll be making your GPU caches work overtime. Most modelling software will have some kind of plugin to better condition vertex ordering, so talk to your technical artists to get that sorted somewhere in the asset production process.

 

To maximize the cache efficiency it’s also worth reviewing the efficiency of your vertex shader variable types, both in terms of sizes and number of elements. We see a surprising amount of content which declares vector elements and then leaves many channels unused (but allocated in memory and so using valuable cache space); or which uploads highp fp32 data and then uses it in the shader as a medium fp16 value. Removing unused vector elements and converting to narrower data types (provided the OES_vertex_half_float extension is available) is a simple and effective way to maximize cache efficiency, reduce bandwidth, and improve geometry processing performance.

 

So there you have it, interleaving vertex attributes. It would be remiss of me to tell you to expect immediate vast performance improvements from this technique. At best this will only cleave back a tiny bit of efficiency but in large complex projects where you need to squeeze as much as possible out of the hardware, these tiny improvements can all add up.

 

Thanks again to Peter Harris, who provided a lot of the information for this blog and also was kind enough to go through it afterwards and take out all my mistakes.

The "ARM Game Developer Day" is back by popular demand!

$
0
0

ARM Game Developer Day 1500x640.jpg

 

Be part of the future of mobile game technology. Bring along your project challenges for expert advice on how to maximise performance of your game for mobile platforms. Learn about the latest ARM CPU and Mali GPU architecture, multicore programming and tile-based mobile GPUs and see how to implement highly optimized rendering effects for mobile.

 

The day will feature:

  • VR showcase area with live demos
  • Q&A Clinics and the sharing of best coding practice for mobile platforms
  • Talks on mobile VR, Geomerics advanced applications of dynamic global illumination plus a panel discussion on “the future of mobile game technology”
  • Open discussions with ARM experts about your own project challenges and suggested improvements
  • The opportunity to network with top mobile game developers and engineers
  • Free hot lunch and refreshments throughout the day

 

ARM-DevDay-london-225x140.jpgLondon - ARM Game Developer Day

 

Thursday 3rd December 2015

9:00am - 6:00pm

Rich Mix, 35-47 Bethnal Green Rd, London E1 6LA

 

See full agenda and register here to make sure you don’t miss out!

 

 

ARM-DevDay-chengdu-225x140.jpgChengdu - ARM Game Developer Day

 

Wednesday 16th December 2015

9:00am - 6:00pm

iTOWN Coffee, Building A, Chengdu Tianfu Software Park, China

 

See full agenda and register here too!

Virtual Reality: The Ice Cave

$
0
0

This blog was written by Kapileshwar Syamasundar during his summer placement at ARM in the ARM Mali Graphics demo team. Kapil did some great work at ARM porting the Ice Cave demo to VR using Unity, we hope you can benefit from this too.

 

 

Ice Cave VR

 

1. Introduction

Ice Cave, the latest demo from ARM Mali Ecosystem, has been shown with great success this year in such major events as GDC, Unite Europe, and Unite Boston. The demo has been developed in Unity and aims to demonstrate that it is possible to render high visual quality content on current mobile devices. A number of highly optimized special effects were developed in-house, specifically for this demo, some of which are based on completely new techniques, for example the rendering of shadows and refractions based on local cubemaps.

Fig1.png

Figure 1 View of the cave from the entrance in the Ice Cave demo.

 

The Ice Cave demo was released at a time when Virtual Reality has become the centre of attention in the game development community, and related events and media. A number of VR demos and games have already been released but VR performance requirements can limit the complexity of VR content and therefore the visual quality of the final VR experience.

It is in this landscape that the Ecosystem demo team decided to port the Ice Cave demo to Samsung Gear VR and this task was assigned to me. In this blog I describe my experience in porting the Ice Cave demo to VR during my eight weeks summer placement in the Ecosystem demo team.

By the time I joined the demo team, Unity had just released a version with VR native support for Oculus Rift and Samsung Gear VR.  Previously, VR support was only available by means of a plugin based on Oculus Mobile SDK, but this had some obvious limitations:

  • Each VR device has a different plugin
  • Plugins may conflict with each other
  • Release of newer VR SDKs / Runtimes can break older games
  • Lower level engine optimizations are not possible with plugin approach of two separate cameras

Conversely, the newly released Unity VR native integration lacked both support and sufficient information for developers, and experienced many unresolved issues. Nonetheless, the team was convinced that with the native integration in Unity we would be able to achieve the best possible performance; a key point in guaranteeing a successful VR user experience.

 

2. Samsung Gear VR

The Samsung Gear VR headset does not have a built in display but has instead been designed to host a mobile phone. At the time of writing, the Samsung Gear VR comes in two versions; one for Samsung Note 4 and another for the latest Samsung Galaxy S6. Some of the main specifications of the Samsung Galaxy S6 version are listed below.

 

·           Sensors: Accelerator, Gyrometer, Geomagnetic, Proximity

 

Fig2.jpg

Figure 2. The Samsung Gear VR for Samsung Galaxy S6.

·           Motion to Photon Latency < 20ms

·           Manual Focal Adjustment

·           Main Physical UI: Touch Pad

·           Oculus’s Asynchronous TimeWarp technology

Samsung Gear VR is powered by Oculus VR software and incorporates the Oculus Asynchronous Time Warp technology. This important feature helps reduce latency, or the time taken to update the display based on the latest head movement; a key issue to avoid in VR devices. Besides the Time Warp technology, the Samsung Gear VR has several sensors which it uses in place of the ones incorporated in the phone.

The Samsung Gear VR has its own hardware and features a touch pad, back button, volume key and, according to the specifications, an internal fan designed to help demist the device while in use.

The key point here however, is that you can insert your Samsung Galaxy S6 into the headset and enjoy an immersive experience with just a smartphone. We are no longer limited to the screen size of the phone and can instead become completely immersed in a virtual world.

 

3. Main steps to port an app/game to VR in Unity

VR integration in Unity has been achieved following one of the main Unity principles, that it must be simple and easy. The following basic steps are all that are needed to port a game to VR:

·         Unity 5.1 version with VR native support (or any higher version).

·         Obtain the signature file for your device from the Oculus website and place it in Plugins/Android/assets folder.

·         Set the “Virtual Reality Supported” option in Player Settings.

·         Set a parent to camera. Any camera control must set camera position and orientation to the camera parent.

·         Associate the camera control with the Gear VR headset touch pad.

·         Build your application and deploy it on the device. Launch the application.

·         You will be prompted to insert the device into the headset. If the device is not ready for VR you will be prompted to connect to the network where the device will download Samsung VR software.

·         NB. It is useful to set the phone to developer mode to visualize the application running in stereo without inserting into the Gear VR device. You can enable the developer mode only if you have installed previously a VR application appropriately signed.

 

Enabling Gear VR developer mode

•       Go to your device Settings - Application Manager - Gear VR Service

•       Select "Manage storage"

•       Tap on the "VR Service Version" six times

•       Wait for scan process to complete and you should now see the Developer Mode toggle

Developer mode allows you to launch the application without the headset and also dock the headset at any time without having Home launch.

Figure 3. Steps to enable VR Developer mode  on Samsung Galaxy S6.

 

 

Fig4.jpg

Figure 4 Side by Side view of stereo viewports captures with VR developer mode enabled.

 

 

4. Not as simple as it seems. Considering VR specifics.

After following the instructions above, I saw nothing but a black screen when inserting the device into the headset. It took me some time to get the VR application running in order to establish that some existing features had to be changed and others added.

VR is a completely different user experience and this is therefore one of the key issues when porting to VR. The original demo had an animation mode which moved the camera through different parts of the cave to show the main features and effects. However, in VR this animation caused motion sickness to the majority of users, particularly when moving backwards. We therefore decided to remove this mode completely.

We also decided to remove the original UI. In the original Ice Cave demo a tap on the screen triggers a menu with different options but this was unsuitable for VR.  The original navigation system, based on two virtual joysticks, was also unsuitable for VR so we decided to entirely replace it with a very simple user interaction based on the touch pad:

·         Pressing and holding the touch pad moves the camera in the direction the user looks.

·         When you release the pressure the camera stops moving.

·         A double tap resets the camera to the initial position.

This simple navigation system was deemed to be intuitive and easy by all users trying the VR version of the demo.

Fig5.jpg

Figure 5. User interaction with touch pad  on the Samsung Gear VR.

 

The camera speed was also a feature we considered carefully as many users experienced motion sickness when the camera moved just a little too fast. After some tests we were able to set a value that most people were comfortable with.

Additionally, the camera has to be set as a child of a game object. This is the only way Unity can automatically integrate the head tracking with the camera orientation. If the camera has no parent this link will fail so any translation and rotation of the camera has to be applied to the camera parent node.

In VR, as in reality, it is important to avoid tight spaces so the user doesn’t feel claustrophobic. The original Ice Cave was built with this in mind and provides ample space for the user.

The only effect not imported to VR was the dirty lens effect. In the original Ice Cave demo this effect is implemented as a quad that is rendered on top of the scene. A dirty texture appears with more or less intensity depending on how much the camera is aligned with the sun. This didn’t translate well to VR and so the decision was made to completely remove it from the VR version.

Fig6.jpg

Figure 6. Dirty lens effect implemented in the original Ice Cave demo.

 

5. Extra features in the Ice Cave VR version

In the original demo the user can pass through the walls to look at the cave from the outside. However in VR this didn’t create a good experience and the sensation of embedding disappeared when you went out of the cave. Instead, I implemented camera collision detection and smooth sliding for when the user moves very close to the walls.

When running a VR application on Samsung Gear VR, people around the user are naturally curious about what the user is actually seeing. We thought that it would be interesting, particularly for events, to stream the content from the VR headset to another device such as a tablet. We decided to explore the possibility of streaming just the camera position and orientation to a second device running a non-VR version of the same application.

The new Unity network API allowed a rapid prototyping and in a few days I had an implementation which worked pretty well. The device actually running the VR version on the Samsung Gear VR works as a server and in each frame sends the camera position and orientation over wireless TCP to a second device that works as a client.

Fig7.jpg

Figure 7. Streaming camera position and orientation from Samsung Gear VR to a second device.

 

Using the built-in touch pad to control the camera motion proved very successful. Nevertheless, we decide to provide the user with an alternative method of control using an external Bluetooth mini controller readily available elsewhere. This required us to write a plugin to extend the Unity functionality by intercepting the Android Bluetooth events and using them to trigger movement and resetting of the camera. Unfortunately there is not much information available so whilst it was only possible to intercept the messages coming from two keys , this was enough to move/stop and reset the camera.

Fig8.jpgFig9.jpg

 

 

6. Conclusions

Ice Cave VR was implemented during my summer placement with ARM’s Ecosystem Demo team in less than eight weeks with no previous experience of Unity. This was possible thanks to the native VR integration Unity released on version 5.1. In principle, just a few steps are necessary to port a game to VR, although in practice you need to do some extra work to fine-tune the specific requirements of VR in your game. With this integration, Unity has greatly contributed to the democratisation of VR.

Unity VR integration is still in progress and some reported issues are expected to be solved in coming versions. Nonetheless, the Ice Cave VR version shows that it is possible to run high quality VR content on mobile devices if resources are balanced properly at runtime by using highly optimized rendering techniques.

All the advanced graphics techniques utilised in the Ice Cave demo are explained in detail in the ARM Guide for Unity Developers. In the guide it is possible to find the source code or code snippets of these techniques which allowed me to understand how they work.

What I consider the most relevant in all this is the fact that with mobile VR we are no longer limited to the size of our smartphones to enjoy a game. Now we can be part of a limitless virtual world and enjoy a wonderful VR experience from a tiny smartphone inserted in a head set. This really is an outstanding step forward!

 

Now you can check out the video

Mali Performance 7: Accelerating 2D rendering using OpenGL ES

$
0
0

It's been a while since my last performance blog, but one of those lunchtime coffee discussions about a blog I'd like to write was turned into a working tech demo by wasimabbas overnight, so thanks to him for giving me the kick needed to pick up the digital quill again. This time around I look at 2D rendering, and what OpenGL ES can do to help ...

 

A significant amount of mobile content today is still 2D gaming or 2D user-interface applications, in which the applications render layers of sprites or UI elements to build up what is on screen. Nearly all of these applications are actually using OpenGL ES to perform the rendering, but few applications actually leverage the 3D functionality available, preferring the simplicity of a traditional back-to-front algorithm using blending to handle alpha transparencies.

 

mali-shield-2.png

 

This approach works, but doesn’t make any use of the 3D features of the hardware, and so in many cases makes the GPU work a lot harder than it needs to. The impacts of this will vary from poor performance to reduced battery life, depending on the GPU involved, and these impacts are amplified by the trend towards higher screen resolutions in mobile devices. This blog looks at some simple changes to sprite rendering engines which can make the rendering significantly faster, and more energy efficient, by leveraging some of the tools which a 3D rendering API provides.

 

Performance inefficiency of 2D content

 

In 2D games the OpenGL ES fragment shaders are usually trivial – interpolate a texture coordinate, load a texture sample, blend to the framebuffer – so there isn’t much there to optimize. Any performance optimization for this type of content is therefore mostly about finding ways to remove redundant work completely, so that the shader never even runs for some of the fragments.

 

The figure in the introduction section shows a typical blit of a square sprite onto a background layer; the outer parts of the shield sprite are transparent, the border region is partially transparent so it fades cleanly into the background without any aliasing artifacts, and the body of the sprite is opaque. These sprite blits are rendered on top of what is in the framebuffer in a back-to-front render order, with alpha blending enabled.

 

There are two main sources of inefficiency here:

 

  • Firstly, the substantial outer region around this sprite is totally transparent, and so has no impact on the output rendering at all, but takes time to process.
  • Secondly, the middle part of the sprite is totally opaque, completely obscuring many background pixels underneath it. The graphics driver cannot know ahead of time that the background will be obscured, so these background fragments have to be rendered by the GPU, wasting processing cycles and nanojoules of energy rendering something which is not usefully contributing to the final scene.

 

This is a relatively synthetic example with only a single layer of overdraw, but we see real applications where over half of all of the rendered fragments of a 1080p screen are redundant. If applications can use OpenGL ES in a different way to remove this redundancy then the GPU could render the applications faster, or use the performance headroom created to reduce the clock rate and operating voltage, and thus save a substantial amount of energy. Either of these outcomes sounds very appealing, so the question is how can application developers achieve this?

 

Test scene

 

For this blog we will be rendering a simple test scene consisting of a cover-flow style arrangement of the shield icon above, but the technique will work for any sprite set with opaque regions. Our test scene render looks like this:

 

mali-shield-scene-1.png

 

… where each shield icon is actually a square sprite using alpha transparencies to hide the pieces which are not visible.

 

Tools of the trade

 

In traditional dedicated 2D rendering hardware there are not usually many options to play with; the application has to render the sprite layers from back to front to make sure blending functions correctly. In our case the applications are using a 3D API to render a 2D scene, so the question becomes what additional tools does the 3D API give the applications which can be used to remove redundant work?

 

The principal tool used in full 3D scene rendering to remove redundant work is the depth test. Every vertex in a triangle has a “Z” component in its position, which is emitted from the vertex shader. This Z value encodes how close that vertex is to the camera, and the rasterization process will interpolate the vertex values to assign a depth to each fragment which may need fragment shading.  This fragment depth value can be tested against the existing value stored in the depth buffer and if it is not closer1 to the camera than the current data already in the framebuffer then the GPU will discard the fragment without ever submitting it to the shader core for processing, as it now safely knows that it is not needed.

 

Using depth testing in “2D” rendering

 

Sprite rendering engines already track the layering of each sprite so that they stack correctly, so we can map this layer number to a Z coordinate value assigned to the vertices of each sprite which is sent to the GPU, and actually render our scene as if it has 3D depth. If we then use a framebuffer with a depth attachment, enable depth writes, and render the sprites and background image in front-to-back order (i.e. the reverse order of normal blitting pass which is back-to-front) then the depth test will remove parts of sprites and the background which are hidden behind other sprites.

 

If we run this for our simple test scene, we get:

 

mali-shield-scene-2.png

 

Uh oh! Something has gone wrong.

 

The issue here is that our square sprite geometry does not exactly match the shape of opaque pixels. The transparent parts of the sprites closer to the camera are not producing any color values due to the alpha test, but are still setting a depth value. When the sprites on a lower layer are rendered the depth testing means that the pieces which should be visible underneath the transparent parts of an earlier sprite are getting incorrectly killed, and so only the OpenGL ES clear color is showing.

 

Sprite geometry

 

To fix this issue we need to invest some time into setting up some more useful geometry for our sprites. We can only safely set the depth value when rendering front-to-back for the pixels which are totally opaque in our sprite, so the sprite atlas generation needs to provide two sets of geometry for each sprite. One set, indicated by the green area in the middle image below, covers only the opaque geometry, and the second, indicated by the green area in the right image below, picks up everything else unless it is totally transparent (in which case it can be dropped completely).

 

mali-shield-5.png

 

Vertices are relatively expensive, so use as little additional geometry as possible when generating these geometry sets. The opaque region must only contain totally opaque pixels, but the transparent region can safely contain opaque pixels and totally transparent pixels without side-effects, so use rough approximations for a "good fit" without trying to get "best fit". Note for some sprites it isn’t worth generating the opaque region at all (there may be no opaque texels, or the area involved may be small), so some sprites may consist of only a single region rendered as a transparent render. As a rule of thumb, if your opaque region is smaller than 256 pixels it probably isn't worth bothering with the additional geometry complexity, but as always it's worth trying and seeing.

 

Generating this geometry can be relatively fiddly, but sprite texture atlases are normally static so this can be done offline as part of the application content authoring process, and does not need to be done live on the platform at run time.

 

Draw algorithm

 

With the two geometry sets for each sprite we can now render the optimized version of our test scene. First render all of the opaque sprites regions and the background from front-to-back, rendering with depth testing and depth writes enabled. This results in the output below:

 

mali-shield-scene-3.png

 

Any area where one sprite or the background is hidden underneath another sprite is rendering work which has been saved, as that is an area which has been be removed by the early depth test before shading has occurred.

 

Having rendered all of the opaque geometry we can now render the transparent region for each sprite in a back-to-front order. Leave depth testing turned on, so that sprites on a lower layer don't overwrite an opaque region from a sprite in a logically higher layer which has already been rendered, but disable depth buffer writes to save a little bit of power.

 

If we clear the color output of the opaque stage, but keep its depth values, and then draw the transparent pass, we can visualize that the additional rendering added by this pass. This is show in the figure below:

 

mali-shield-scene-4.png

 

Any area where one of the outer rings is partial indicates an area where work has been saved, as the missing part has been removed by the depth test using the depth value of an opaque sprite region closer to the camera which we rendered in the first drawing pass.

 

If we put it all together and render both passes to the same image then we arrive back at the same visual output as the original back-to-front render:

 

mali-shield-scene-1.png

 

... but achieve that with around 35% fewer fragment threads started, which should translate approximately to a 35% drop in MHz required to render this scene. Success!

 

The final bit of operational logic needed is to ensure that the depth buffer we have added to the scene is not written back to memory. If your application is rendering directly to the EGL window surface then there is nothing to do here, as depth is implicitly discarded for window surfaces automatically, but if your engine is rendering to an off-screen FBO ensure that you add a call to glInvalidateFramebuffer()  (OpenGL ES 3.0 or newer) or glDiscardFramebufferEXT() (OpenGL ES 2.0) before changing the FBO binding away from the offscreen target. See Mali Performance 2: How to Correctly Handle Framebuffers for more details.

 

Summary

 

In this blog we have looked at how the use of depth testing and depth-aware sprite techniques can be used to accelerate rendering using 3D graphics hardware significantly.

 

Adding additional geometry to provide the partition between opaque and transparent regions of each sprite does add complexity, so care must be taken to minimize the number of vertices for each sprite otherwise the costs of additional vertex processing and small triangle sizes will out-weigh the benefits. For cases where the additional geometry required is too complicated, or the screen region covered is too small, simply omit the opaque geometry and render the whole sprite as transparent.

 

It’s worth noting that this technique can also be used when rendering the 2D UI elements in 3D games. Render the opaque parts of the UI with a depth very close to near clip plane before rendering the 3D scene, then render the 3D scene as normal (any parts behind the opaque UI elements will be skipped), and finally the remaining transparent parts of the UI can be rendered and blended on top of the 3D output. To ensure that the 3D geometry does not intersect the UI elements glDepthRange() can be used to limit the range of depth values emitted by the 3D pass very slightly, guaranteeing that the UI elements are always closer to the near clip plane than the 3D rendering.

 

Tune in next time,

Pete

 


[1] Other depth test functions are possible in OpenGL ES, but this is the common usage which is analogous to the natural real world behaviour.

 


 

Pete Harris is the lead engineer for the Mali GPU performance analysis team at ARM. He enjoys spending his time working on a whiteboard to determining how to get the best out of complex graphics sub-systems, and how to make the ARM Mali GPUs even better.

Is Virtual Reality the Next Big Thing…Again?

$
0
0

In this blog series we’ll be looking at the current status of Virtual Reality with a special focus on mobile VR, where it’s heading in the future and what ARM® can do to help us get there. We’ll be covering the common challenges and pitfalls and how these can be avoided to produce a great VR experience. We’ll also keep you up to date with other blogs and discussions around technical tips, tutorials and FAQ’s; so stay tuned for the future of mobile VR.

 

Where VR is now

Virtual Reality isn’t new, people have been talking about it since the 90s, so why has the industry never quite taken off in the way we might expect? The quick answer is that the technology simply wasn’t there. The hardware was prohibitively expensive and very bulky and the graphics capabilities were too limited to produce a successful VR user experience - unless you consider motion sickness a success. Now however, lower cost hardware based on existing platforms is changing the game, with mobile platforms offering console-like performance. Not only that, but existing mobile devices already contain many of the sensors VR requires, from gyros to accelerometers, opening up a whole world of mobile VR possibilities.

 

What’s next for VR

The Virtual Reality industry has a forecast worth of US$30 billion by 2020, and that all has to come from somewhere. Fig1.jpg

Fig.1 Digi-Capital VR Revenue Forecast

 

Gaming is of course a huge industry and a high-end, immersive gamer experience can now be literally at your fingertips. Mobile VR allows you to become fully involved in your chosen game at home, work, or while trying to escape the monotony of public transport; but that’s not all VR can do. Researching a university assignment can be a chore, but how about if you could visit the most relevant museums or seminars without having to leave the dorm? VR allows us to see exhibitions in world class museums and galleries without an expensive trip to London, Paris, or anywhere else. Shopping too, isn’t everyone’s favourite pastime, especially around the Christmas rush. Wouldn’t it be great if you could wander the aisles and compare options for your next car, sofa, TV or even pair of shoes, without tripping over pushchairs or being upsold by pushy assistants? All this is possible with the huge technical advances in VR and it’s only a matter of time until this is our standard way of working.

 

Fig2.jpg

Fig.2 nDreams® Perfect Beach experience allows you to get away from it all without leaving the sofa

 

So how does VR actually work?

Technology is the key to VR success and this blog series will talk about exactly what you need to make it happen. VR comes in mobile or desktop options, but according to Oculus® Co-founder Palmer Luckey, desktop VR is seriously compromised by the requirement for a ‘cable servant’ to follow the user around preventing trip hazards. So mobile VR is the quickest way forward, and the simplest of the mobile options allows you to simply slot your smartphone into the headset and get started. The headset provides you with a stereoscopic display, with two marginally different images rendered for the left and right eye, allowing the user to experience depth. Barrel distortion is then applied to the rendered images in post processing to counteract the curvature of the lenses.

 

Fig3.jpg

Fig.3 Marginally different images for each eye allow the perception of depth and barrel distortion applies curvature to the image to counteract the curvature of the lens

 

Finally, sensors in the device detect the movement of your head and adjust the scene in real time to render the updated view to the headset and allow realistic visual feedback. Going forward, additional sensors will facilitate live hand-tracking for a truly immersive experience, and this can be combined with the use of an inbuilt or add-on controller to allow you to interact fully with your virtual surroundings.

 

VR Optimisation with Mali GPUs

As with any emerging technology, there are issues that can stand in the way of a truly successful VR user experience. These include low resolution blurring the image and compromising visual quality, or a low frame rate making the display appear stilted or jerky. A major issue experienced when developing for VR is latency, or the time it takes for the on-screen image to catch up with the user’s head movement, and this is one of the key causes of sickness or dizziness in VR users.

The ARM® Mali™ GPU family is the world’s #1 licensable GPU in terms of shipments and is perfectly positioned to deliver an optimum VR experience. Mali GPU architecture enables high resolution and power saving through various features such as Adaptive Scalable Texture Compression (ASTC); and ARM Frame Buffer Compression (AFBC) dramatically reduces system bandwidth, with performance fully scalable across multiple cores. Mali support for extensions to OpenGL ES and EGL reduce latency and improve overall performance.

 

What we’re doing now

At events like VRTGO ARM recently demonstrated  how great a mobile VR experience can be with the Mali-based Samsung® Gear VR headset, a collaboration from Samsung Mobile and Oculus. The first version was based on the Galaxy Note 4, with the second generation now available for the Galaxy S6, both powered by the Mali-T760. The Ice Cave Demo, featuring Geomerics Enlighten global illumination in collaboration with RealtimeUk; was easily ported to VR on the Samsung Gear VR headset; read about how we did this here.


Superior visuals and a smooth user experience are all possible in mobile VR and throughout this blog series we’ll be discussing the common challenges surrounding developing for VR and how ARM’s technology and Mali GPUs can help you overcome them.

 

Stay tuned for more on mobile VR!

Virtual Reality: A Focus on Focus

$
0
0

As discussed in our virtual reality (VR) intro blog, whilst the VR industry is moving from strength to strength, there are still a few key issues which often stand in the way of developing a truly successful VR experience. For the purposes of this blog series we’ll be focussing on what we consider to be some of the main blockers. We’ll discuss the best ways to overcome these with technology and techniques that are currently available, as well as considering what we foresee in the future of the field.

So, to focus on focus. Our eyes are complicated objects and work very precisely to ensure the images we see are processed clearly and accurately (unless you’re a secret spectacle-wearer like me, in which case they’re letting the side down to be honest). The eye works in a similar way to a magnifying glass by concentrating light through the cornea and lens and directing it to the retina. When an image is in focus, all paths of light from a single pixel travel to a single point on the retina, allowing us to see it clearly. When an image is out of focus however, that same pixel travels to different parts of the eye, causing the image to appear blurry.

Fig1 In Focus Final.png

Light paths reaching the eye in focus

Fig2 Out of Focus Final.png

Light paths reaching the eye out of focus

 

This is a particular issue in current mobile VR systems as the images you see are all the same distance from your eye, namely just a few centimetres away on the screen of your smartphone. The images remain at this focal depth even if the differences between what is shown to each eye tell us that the image is at a different depth. This conflict between the depth from the stereo image (vergence) and the apparent focal depth (accommodation) makes it difficult for our brains to adjust. This can lead to visual discomfort, headaches and nausea.

 

The impact on VR

The simplest type of video content produced for VR is monoscopic, using what amounts to a single ‘360°’ camera. In practice this single image will likely use multiple cameras to cover the entire field of view and stitch the separate images together retrospectively. This is the cheapest and easiest method of 360° video production but doesn’t provide any of that vital depth information. A more complicated, but arguably better, approach is to use stereoscopic video which is produced by capturing two 360° images, one for each eye, providing depth perception from the stereo image. This method is more difficult to get right due to the complications of stitching the images for each eye together accurately. Capturing enough information to reproduce the focal depth as well as a stereo image is more complicated still, but advances are being made every day.

Arguably, the most successful way of addressing this issue of focus currently is with the effective use of light field displays. A ‘light field’ is how we describe all the light travelling through a region of space. A simple way to think about a light field display is to think about a regular display that could show you a different picture depending on the direction from which you viewed it. Although it may not be obvious, light field displays allow us to show images that have different focal depths. Objects further away from the viewer would create a noticeably different light field on the display than the same objects closer up. This requires the eye to adjust its focus to see them clearly, removing the brain-confusing mismatch between the focal and stereo depths.

 

Micro-lens Arrays

One way of creating a light field display to improve focus in VR is with the use of a micro-lens array, which is an overlaid transparent sheet with huge numbers of tiny, raised lenses.

Fig3.1 micro lens.pngfig3.2 microlens.png

Micro-lens array

 

Each micro-lens covers a small number of pixels of a regular display. These are beginning to emerge as technologies for wearables such as smartwatches and the image you see changes dependent upon which way you view it, a bit like an advanced version of the lenticular images you get in breakfast cereal boxes. However, this micro-lens method forces a trade off against resolution as it’s effectively turning multiple pixels into one. 

Fig4 Micro-lens display.jpgMicro-lens display

 

Micro-lens arrays are also reported to be complicated to produce at present, so we’ll also consider an alternative option.

The Light Field Stereoscope

To take advantage of the depth benefits of stereoscopy, multi-layer displays are currently being researched, where multiple display panels are layered with small gaps separating them. The eye sees each panel at a different focal distance and so by carefully crafting the content displayed on each layer, the display can place images at an appropriate depth. At SIGGRAPH 2015, Stanford University’s Huang et al presented the ‘Light field Stereoscope’ in which two LCD panels are backlit and placed one behind the other in the VR headset, with a spacer between. This allows us to project the background, or distant, images on the rear screen while the front screen displays images in the foreground of the scene you’re viewing. Distances in the middle of this range can be depicted by displaying partial images on each. This approximate light field adds some focal depth to the display, with objects in the foreground occluding those further back. The interaction of the two 2D displays is not the same as a true 4D light field but may well be sufficient. However, while there isn’t the same resolution trade off that we saw with the micro-lens approach, the front LCD panel acts as a diffuser and can therefore introduce some blurring.

Fig5 Light Field Display.jpg

Huang et al’s Light Field Stereoscope

 

What comes next?

To accompany light field displays, in recent years we have seen the emergence of light field cameras such as Lytro, which capture the full 4D light field across a camera aperture. These computational cameras allow you to produce light field images that allow you to refocus or move the view position after capture, opening up all kinds of creative possibilities. These cameras are also ostensibly better than your average camera as they typically capture more light. Next-generation 360° light field cameras offer to take this further and extend the focus and viewpoint-independent properties of light field images to VR. This bodes well for the future of 360° light field video playback on mobile VR devices. It will allow the user to explore the whole scene with freedom of head movement and natural depth perception, all within a VR headset. Another emerging area in VR is positional tracking on mobile, which will allow real time response of the images to the actual physical location of the user’s head, a vital point for achieving a truly immersive experience and something we’ll be considering in more depth in the future.

 

Follow me on the ARM Connected Community and Twitter to make sure you don’t miss out on the next post in our VR series!


Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

$
0
0

ARM®’s newest high performance GPU, the Mali™-T880, delivers better than ever graphics performance whilst remaining well within the energy constraints of a mobile platform. The exciting possibilities this opens up are beginning to be realised with the recent announcements of the Mali-T880 appearing in key devices by two of ARM’s partners.

 

As graphics content for mobile devices becomes increasingly complex, it places greater and greater pressure on the computational capabilities of GPUs. For a start, consumer expectations are that mobile gaming will provide the same high quality user experience as that available on console. The recent surge in virtual reality content, which is expected to make the biggest impact in mobile rather than desktop, also means that high capability GPUs are essential for all levels of mobile devices. Even mid-range smartphones these days are required to perform increasingly challenging computational procedures at low cost, so the next generation of high end mobile devices to hit the market are facing ever higher requirements for superior graphics capability. One of the key requirements is to be able to deliver high-end mobile gaming and applications, whilst still constrained by the power (and therefore thermal) limits of mobile devices.

 

The ARM Mali family of GPUs is perfectly positioned to address the needs of the whole mobile industry from entry level to top of the line, next-generation products. Based on the Mali Midgard architecture, the Mali-T880 is scalable from one to sixteen cores. It features an additional arithmetic pipeline per shader core in order to handle greater computational throughput than its predecessors and is up to 40% more energy efficient. Not only that, but the adoption of bandwidth saving technologies such as ARM Frame Buffer Compression (AFBC), Smart composition and Transaction elimination all allow for the Mali-T880 to be extremely efficient on memory bandwidth.

T880_Detail_Diagram1.jpgARM Mali-T880 GPU Structure

 

Two of ARM’s partners have recently released their take on the Mali-T880, Samsung and Huawei have both taken advantage of the exceptional performance and leading energy efficiency of the Mali-T880 to fulfil these very current needs. Huawei’s Mate 8 smartphone, announced last week, is based on the HiSilicon Kirin 950 chipset and contains an optimal combination of ARM Cortex-A72 and Cortex-A53 CPUs and the Mali-T880 GPU to provide a harmonized suite of efficient computing power. Samsung’s Exynos 8 Octa uses the Mali-T880 to ensure the highest quality user interface, immersive gaming and virtual reality content.

Mate8s.pngHuawei’s New Mate 8 Smartphone

 

2016 is set to feature yet another step forward in high end devices and with the launch of the Samsung Gear VR, consumer expectations for mobile graphics continue to grow. As more and more of ARM’s partners begin to utilise the Mali-T880, keep an eye out for the next generation of devices hitting the market in the new year.

 

Stay tuned for more news on Mali and Premium mobile, there are heaps of exciting developments to come.

Shedding Light on Global Illumination Techniques

$
0
0

This blog is based on a recent article written by Geomerics for the publication Making Games. For more information on any of the techniques outlined in this blog, read more here.

 

Lighting is the most fundamental element of any game environment. Applied correctly, it brings three dimensional geometry to life, amplifies the emotions of a scene and draws the eyes to important areas of focus. Global illumination, the interplay of lights and materials, plays a significant role in creating this balanced scene lighting.

   Direct vs Combined.png

Image 1: The left image shows direct lighting only; the right image shows the difference the addition of global illumination makes.

 

It is vital that games industry members understand the variety of techniques available and adopt the one that is most suitable for their platform, environment, gameplay and performance budget. In this article we’ll explore a variety of options available, weigh up their advantages and limitations and conclude with recommendations for different scenarios.

 

1. Baked Lighting

Baked lighting is an extremely common technique used for both direct and indirect lighting information as well as other effects such as ambient occlusion. It fully pre-calculates lighting information for static scenes and stores the output in static data structures which are then consumed by fragment shaders. The most commonly used data structures are textures, spherical harmonics probes and cubemaps.

 

Baking can produce high quality results and is among the most affordable runtime techniques, enabling performance efficient indirect lighting even on mobile devices.  However, everything is static at runtime both in the game and in the editor. Light sources, materials and meshes can’t be changed at runtime. The only option for dynamic content is to swap out entire data sets, which is very costly with respect to memory consumption as well as creating and managing this additional content. This either limits dynamic gameplay options or requires sacrifices with respect to visual quality or dynamic content.

 

In addition, baked lighting has the lengthiest workflow of all the global illumination solutions listed in this article. In order to see the effect of any changes made to a scene, no matter how small, artists need to run a full lighting bake. Given that in order to get a scene to AAA quality an artist will need to iterate on each individual light many times, baking can have a significant impact on a studio’s bottom line as well as the quality of the final result.

 

2. Light Propagation Volumes

The technique for light propagation volumes was proposed by Kaplanayan and Dachsbacher in 2010 as an alternative to traditional baked lighting and first implemented in CryEngine 3. The core idea behind this technique is to inject virtual point lights into a three-dimensional data structure, aggregate the injected lights for each cell, and finally iteratively propagate them to neighbouring cells.

 

Light propagation volumes allow lights, materials and objects to be fully dynamic at runtime. They work well with complex geometry and are relatively stable and flicker-free. At basic quality levels the technique can even scale down to mobile platforms. In addition, light propagation volumes can theoretically support multiple bounces, glossy reflections and participating media; however, in reality these extensions are considered too expensive for practical applications and the computational resources are rather used to improve the base lighting quality.

Isotropic participating media and glossy reflections.jpgImage 2: Light propagation volumes can support participating media and glossy reflections

 

The technique has a number of issues that limit its application, the biggest being their low spatial resolution due to memory constraints even when employing cascades. Due to this, light and shadow leaking are a common occurrence and the lighting result is fairly ambient with low contrast. Combined with the fact that runtime performance scales with the number of light sources, light propagation volumes are mostly used for outdoor scenes with only a single directional light source.

 

3. Voxel Cone Tracing

While light propagation volumes calculate lighting for each cell and apply the indirect lighting result by a simple lookup into a volume texture, voxel cone tracing only calculates lighting in areas where there is actual geometry while performing a much more involved aggregation of the indirect lighting in the fragment shader.

LPV vs VCT.jpgImage 3: Comparison of light propagation volumes and voxel cone tracing

 

Voxel cone tracing provides an elegant light propagation and aggregation model for both indirect lighting as well as specular reflections. It is able to handle complex geometry including transparency and participating media while delivering smooth lighting results at a higher quality than light propagation volumes. However, it is very expensive to achieve these quality levels and the technique cannot scale to more budgeted platforms. Voxel cone tracing still suffers from significant light leaking issues even at medium quality levels and the directional information required to significantly improve quality would further increase memory cost.

 

4. Pre-computed Visibility

Determining the visibility between surfaces is the most expensive part of calculating global illumination. By pre-computing the surface-to-surface visibility information, compressing this data and only at runtime combining it with material and direct lighting information, it is possible to provide high quality dynamic global illumination effects across all gaming platforms – even on mobile.

  Precompute.png

Image 4: This graph visualizes how pre-computed visibility global illumination works

 

Pre-computed visibility global illumination is similar to baked lighting in respect to storing its data output in the same three data structures: lightmaps, spherical harmonics probes and cubemaps. The difference is that precomputed visibility makes dynamic global illumination effects such as time of day, changing materials and player controlled lighting efficient enough to occur at runtime no matter the platform. Additionally, once the initial pre-computation is done, all light sources and materials can be changed in real-time in the editor and the resultant effect on the global illumination can be seen instantly. This allows artists to iterate over the lighting of a scene much more efficiently than with baked lighting, resulting in huge time-savings and often better visual quality.

 

The limitations to this technique include the requirement for enough geometry in the scene to be static to enable a precompute – certain environments, for example procedurally generated, completely dynamic or user-generated content, are not suitable. And while it is still much faster than traditional baked lighting, the pre-computation step may well be an adjustment for artists who are used to other techniques.

 

ARM’s Enlighten is the industry standard for pre-computed visibility global illumination. It is easily accessible as the default global illumination solution inside Unity 5 and licensable for Unreal Engine 3 and 4 as well as for internal engines.

 

Conclusion

The aforementioned techniques for computing global illumination can broadly be sorted into two categories:

 

Surface Techniques

Volume Techniques

Baked lighting

Light propagation volumes

Pre-computed visibility

Voxel cone tracing

 

Surface techniques compute indirect lighting based on the transfer of light between two-dimensional surfaces, whereas volume techniques rely on an underlying three-dimensional data structure. Surface techniques require different degrees of pre-computation and offer limited support for dynamic objects. However, they are affordable at runtime, don’t suffer from light or shadow leaking and are the only techniques that have proven scalability down to mobile platforms.

 

So if you are developing a game right now, which global illumination technique is best for you? The answer to this question depends strongly on the game environment and target platforms you are considering.

 

Provided that your scenes are static, baked lighting is a great option for affordable lighting that scales to mobile.

 

If your game consists largely of outdoor areas and can be lit by a single directional light source, light propagation volumes is a techniques that you might want to consider.

 

For those titles which are intended only for high-end PCs, voxel cone tracing is an option. Yet, in many cases it is too expensive, especially with respect to memory consumption.

 

In all other cases, pre-computed visibility is a scalable, flexible and proven technique. If your game consists of enough static meshes to support the necessary surface-to-surface pre-computation, it offers the benefits of dynamism and rapid artist workflow along with the scalability enabled by baked lighting.

 

Enlighten.png

Mali Christmas, Popular content of 2015

$
0
0

Hello everyone,

 

We're all looking forward to Christmas so I wanted to keep this one short, for those who might be checking in over the holidays here's some of the most popular content from the ARM Mali Graphics group this year, enjoy!

We hope you've found all our content this year to be useful and they'll be plenty more next year.

 

Missed the Graphics week back in April? Take some time to read the great blogs that came out of this week long Community event:

ARM Mali Graphics Week - April 13 -17, 2015

 

Virtual Reality was a hot topic this year, reflect with this series from freddijeffries:

Is Virtual Reality the Next Big Thing…Again?

Virtual Reality: A Focus on Focus

Virtual Reality: The Ice Cave

and learn about the Mali GPUs set to drive this innovation forward in 2016

Mali-T880 is set to Deliver the Premium Mobile Experience of 2016

 

Why not learn more about the ARM Mali demos and some of the techniques from these popular blogs:

Inside the Demo: GPU Particle Systems with ASTC 3D textures

Dynamic Soft Shadows Based on Local Cubemap

also worth a look are the Reflections Based on Local Cubemaps in Unity

From Hours to Milliseconds: Project Ice Cave

Achieving the icy wall effect in the Ice Cave demo

Learn to achieve great lighting and Global Illumination in your games with Enlighten by Geomerics

Shedding Light on Global Illumination Techniques

Forging the Future of Video Games with Enlighten 3

Lighting in Games

 

Looking for some projects over Christmas? Perhaps you've got a new Chromebook (User Space drivers available here), why not get Linux set-up with the below guide :

Linux on Chromebook with ARM® Mali™ GPU

 

Take a look at the ARM tools and getting them set-up with your device:

lorenzodalcol dived into the Mali Graphics debugger with a two part case study:

Mali GPU Tools: A Case Study, Part 1 — Profiling Epic Citadel

Mali GPU Tools: A Case Study, Part 2 — Frame Analysis with Mali Graphics Debugger

 

Don't forget about DS-5 Streamline, this tool will help you find where your application's bottlenecks are, get it set-up on your ARM Mali device thanks to mcgeagh:

Using DS-5 Streamline with Mali on Samsung Galaxy Note 3 & 10.1

Using DS-5 Streamline with Mali on Samsung Galaxy Note 4

Using DS-5 Streamline with Mali on Samsung Galaxy SIII

Using DS-5 Streamline with Mali on Google Nexus 10

 

Working on performance on Mali in the new year? Learn from the expert with these series of blogs:

peterharris' series of Mali Performance blogs are a great read:

Mali Performance 1: Checking the Pipeline

Mali Performance 2: How to Correctly Handle Framebuffers

Mali Performance 3: Is EGL_BUFFER_PRESERVED a good thing?

Mali Performance 4: Principles of High Performance Rendering

Mali Performance 5: An Application's Performance Responsibilities

Mali Performance 6: Efficiently Updating Dynamic Resources

Mali Performance 7: Accelerating 2D rendering using OpenGL ES

 

Want to know more about the Mali GPU? Check out the below series also from peterharris:

The Mali GPU: An Abstract Machine, Part 1 - Frame Pipelining

The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering

The Mali GPU: An Abstract Machine, Part 3 - The Shader Core

 

 

Merry Christmas,

Ryan

ARM enables power efficient, enhanced visual experiences on 2.5K and 4K mobile devices with the Mali-DP650 Display Processor

$
0
0

Yesterday ARM introduced its highest performing and most power efficient display processor technology, the ARM® Mali™-DP650, which targets next generation premium and mass market devices. 

 

The Mali-DP650 delivers performance and power efficiency gains of up to 2X compared to its predecessor, the Mali-DP550, in order to address the shift in mobile panel resolution from Full HD (1920x1080 pixels) to 2.5K formats including WQXGA (2560x1600 pixels) for tablets and WQHD (2560x1440 pixels) for a mobile screen. The Mali-DP550 remains the processor of choice for HD (1280x720 pixels) and Full HD devices.

 

Mali-DP650 offers enhanced 4K (3840x2160 pixels) capabilities and the ability to send 4K to a larger screen through WIFI streaming (e.g.Miracast) or via a connector (such as HDMI or USB-C). It enables more display processor use cases at 4K further reducing the requirement to engage other hardware blocks such as the GPU, thereby saving power and prolonging battery life.

 

Like its predecessor, the Mali-DP550, the Mali-DP650 offers an array of advanced features such as 7-layer composition, rotation, high quality scaling and a variety of pre and post-processing functions for video and graphics.

 


ARM Mali dp650 diagram

 

Mali-DP650: Performing functions such as composition, rotation, advanced scaling, colour conversion and tone mapping in a single pass through external memory 

 

 

 

Addressing the performance challenges as mobile panels move towards 4K

 

Previously, it was thought that Full HD video and graphics content offered sufficient detail for a mobile screen. However, there is no doubt that being able to display video, UI and gaming content at even higher resolutions makes the overall output sharper and clearer. This has already led to a visible market presence of mobile screens capable of delivering higher resolutions - WQHD, WQXGA and even 4K is now starting to appear. This, however, poses a trade-off as handling higher resolution display layers natively means a costlier SoC in terms of both size and power consumption. Yet demand for such SoCs is growing and so therefore is the demand for effective solutions to address this issue with minimum impact on area, bandwidth and power.

 

Mali-DP650 doubles the available AXI bus width to 128-bits and adds significant memory subsystem optimizationswhich results in doubling the pixel throughput at the specified AXI clock rate. This exceeds performance expectations by being able to composite, process and display a combination of up to three full-frame 4K graphics and 10-bit video layers without any external intervention. This means that heavy composition use cases such as 4K video playback or gaming with UI overlays can be offloaded from the GPU onto the Mali-DP650 display processor, delivering power savings and allowing the GPU to focus on tasks it’s optimized for.

 

 

Blog2 (2).jpg

 

Mali-DP650 in a typical Mass Market Mobile system

 

 

 

Mali-DP650 also provides an MMU pre-fetcher solution to hide the delays caused by page table walks and enhances the overall interoperability with ARM’s CoreLink MMU-500 thus improving its overall tolerance to system latency. This translates to being able to build a lower cost interface to the display processor whilst still being able to deliver the same superior quality required by Full HD and beyond state-of-the-art mobile panels.

 

ARM Frame Buffer Compression (AFBC) is a system bandwidth saving technology which is known to deliver up to 50% system bandwidth saving if implemented throughout the multimedia system (GPU, Video and Display). Mali-DP650 implements AFBC and further optimizes AFBC efficiency by ensuring superior performance with DDR controllers.

 

Mali-DP650 display output is compatible with MIPI (Mobile Industry Processor Interface) DPHY transmitters.  The Mali-DP650 offers split-display mode which is necessary to be able to talk to MIPI DPHY mobile panels with screen resolutions beyond Full HD. Split display mode allows for an image to be split into two equal parts at the display output thereby allowing two separate MIPI DSI transmitters to be driven in parallel. This doubles the available data rate making display resolutions up to 4k a reality on mobile or tablet devices.  

 

Variable refresh rate saves power on the mobile panel

 

A display processor sends out pixel information to be displayed at a refresh rate that matches the rate at which the panel refreshes. In most systems today this is 60 times a second (60Hz/60fps). If the display processor did not send the information at the required rate then the panel would not be able to synchronize with the system and the visual output would not be displayed (or would be displayed with clear visual artefacts). Always sending output frames from the system to the panel at a fixed, and high, refresh rate has a key disadvantage. Even when the system is producing new visual information at a much slower rate (e.g. during video playback or when displaying a static webpage), the display processor is still running at the high refresh rate in order to keep the panel fed with information. In mobile, battery-powered systems this is detrimental to the battery life as the majority of the display subsystem must remain running continuously. To counter that, Mali-DP650 supports variable refresh - a technique whereby the output signal from the display processor is not transmitted at a constant and fixed rate but can vary on a per frame basis. The lowest frame rate will depend on the panel itself but a typical target would be 30Hz, which contributes to a longer battery life.

 

Enabling more composition use cases at resolutions beyond Full HD

 

Mali-DP650 can receive the 10-bit per pixel 4K video layer in either AFBC or YUV formats alongside additional UI layers (RGB888a or AFBC formats) and composite them with alpha blending. It can also rotate, downscale or upscale to the required resolution and even tone map the final output before sending it to the display. All of this functionality is performed in a single pass through the display processor without having to write back to external memory, thereby significantly reducing the overall system bandwidth and consequently, the power consumption. In addition, Mali-DP650 can be configured to act as a content passport for users as they move from a smaller mobile screen to a larger screen - 4K streaming video playback to external 4K displays via either HDMI, USB-C or through Wi-Fi transmission.

 

 

Blog3.jpg

 

Another common video use case on a mobile device is a mobile video call via IP. This requires simultaneous video streams, one from the local camera and one from the other device, to be composited as a picture-in-picture (PIP) on the mobile display. The Mali-DP650 display processor enables PIP on the mobile display by allowing composition for the two video layers along with an additional graphics UI overlay, and can also upscale or downscale to meet the requirements of the local mobile panel. This again provides system-wide power savings by enabling composition, rotation and scaling without having to engage the GPU.

 

 

Blog4.jpg

 

 

And it’s not just about the hardware…

 

Mali-DP650 is delivered with an Android HW Composer driver and an associated ADF-based Kernel Driver optimized to work alongside Mali GPU and Mali Video drivers. The ADF kernel device drivers are licensed under GPLv2 and are available in the driver section of the Mali Developer Centre. For more information on the Mali Display Android drivers, please check out the following blog: Do Androids have nightmares of botched system integrations?

 

In summary…

 

The Mali-DP650 display processor is 4k capable and optimised for 2.5k. It offers new features and performance enhancements that enable beyond Full HD mobile and tablet displays with optimal power usage. Mali-DP650 aims to drive new visual content and gaming experiences on mobile devices and act as a content passport for users as they move from a smaller mobile screen to a larger screen, such as a 4K smart television. Mali-DP650 further expands the Mali family of display processors, including Mali-DP550, which remains the processor of choice for mobile panels below Full HD. Display processors are critical for enabling the best overall user experience and prolonging the battery life of the device and we expect the Mali-DP650 to begin appearing in devices from early 2017.

Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

$
0
0

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

 

Introduction

 

This is the first article of three that will focus on the implementation of Fast Fourier Transform(FFT) using the mixed-radix method on Mobile ARM® Mali™ GPU by means of OpenCL™.

This blog series continues the work of neiltan who analyzed the main strategies for optimizing the radix-2 FFT algorithm in his blog Optimizing Fast Fourier Transformation on ARM Mali GPUs.

In this first post, we are going to build up a minimal background for the 1D complex to complex FFT algorithm by starting to point out the limits of DFT using the direct computation and exploiting the well-known and commonly used radix-2 FFT.

The mathematical prospective of FFT mixed-radix will be used to explain the key concepts of digit-reverse, twiddle factor multiplication and radix computation. This will be covered extensively in the second article from the point of view of an implementation in OpenCL.

 

Before starting, we'd like to thank Georgios Pinitas and Moritz Pflanzer for their important help in reviewing this article and for their contribution to this implementation.

 

 

Discrete Fourier Transform

 

The Discrete Fourier Transform (DFT) is a domain transform (time -> frequency, image -> frequency,..) widely used in real-time digital signal processing for a variety of tasks such as spectrum analysis and convolution in the frequency domain.

It is defined as follow:
dft.png

where:

  • x(n) where n [0, N-1] is the n-th element of the input complex data sequence uniformly sampled
  • X(k) where k [0, N-1] is the k-th element of the output complex DFT

 

The inverse function of the DFT, known as the Inverse Discrete Fourier Transform (IDFT), allows us to obtain the original function.

idft.png

Commonly the above expressions are written in the following tightly-packaged forms:

dft_twiddle.png

idft_twiddle.png

with

twiddle_factor.png

twiddle_conj.png

convolution.png

   Convolution in the frequency domain corresponds to a simple pointwise multiplication

 

 

Analyzing the complexity of the direct computation of DFT

 

Generally the complexity of an algorithm can be defined in terms of a number of multiplications. We can easily infer the complexity of direct computation decomposing the N-point DFT in the real and imaginary parts:

decomposition_dft.png

where in the above relation we have applied the following trigonometric relation:

trigonometric.png

Equating the real and imaginary part:

real_part_dft.png

imag_part_dft.png

the direct computation would require:

  • 2 trigonometric multiplications for N iterations for the real part = 2*N
  • 2 trigonometric multiplications for N iterations for the imaginary part = 2*N

 

Since we have to compute K output complex values, we have:

  • 2 * N * K for the real part
  • 2 * N * K for the imaginary part

 

therefore the total multiplications are: 2*N^2 + 2*N^2 = 4 * N^2

 

declaring the complexity of direct computation as O(N^2).

 

Fast Fourier Transform - Cooley–Tukey algorithm

 

FFT is an efficient algorithm for producing exactly the same result (in the limit of precision) as DFT but with a complexity of O(N log2 N).

This method (and the general idea of FFT) was popularized by a publication of J. W. Cooley and J. W. Tukey in 1965, but it was later discovered that those two authors had independently re-invented an algorithm known to Carl Friedrich Gauss around 1805 (and subsequently rediscovered several times in limited forms).

https://en.wikipedia.org/wiki/Fast_Fourier_transform

The most famous and commonly used FFT's algorithm is radix-2 Cooley–Tukey. It works just with N power of 2 and behind this approach we find the "divide et impera" strategy which recursively breaks down the DFT into many smaller DFTs of size 2 called radix-2.

 

Essentially it is like we are seeing N composed by log2(N) 2's factors (i.e. N = 2 x 2 x..)

radix2_basic_element.png

radix2_n_8_2.png

Looking at the picture we have:

  • log2(N)radix-2 stages. Each one composed of N/2 radix-2 basic elements
  • (log2(N) - 1)twiddle factor multiplications stages

 

Please note that after the first radix stages, the inputs of each radix basic element will not be in consecutive order. In particular there will be an offset called "span" between each input which will depend on the radix stage.

single_radix_element.png

 

 

From radix-2 to mixed-radix

 

Often for signal processing applications we design our system with radix-2 FFTs in mind. However, having N power of 2 for exactly fitting radix-2 FFTs can cause a significant performance drop in some applications.

For instance, if we consider the worst case for a 1D FFT, forcing the use of radix-2 would require double the amount of data input and consequently double the amount of computation. Moreover, this loss of performance would be much worse for the 2D case. In applications of image processing we have quite often dimensions not power of 2 and then doubling the size of both rows and columns, it would increase by a factor of 4 the total FFT size.

 

However, even if we have previously factorized N as N = 2 x 2 x 2..., any factorization would generally be possible.

Factorizing N as N = N1 x N2 x N3 x .. the "divide et impera" strategy remains the same but the new algorithm recursively breaks the DFT into many smaller DFTs of sizes N1, N2, N3 called respectively radix-N1, radix-N2, radix-N3,..

 

The generalization of the basic radix-2 FFT is called mixed-radix.

 

In terms of pipeline stages we are going to have:

  • N / Ns DFTs of length Ns for each radix stage, where Ns is the radix order
  • Z radix's stage with Z equal to the number of factors used for compounding the original length N.

7x5x3.png

 

From this first analysis we can underline that the factorization is the fundamental principle behind the Cooley–Tukey algorithm and the radix-2 algorithm is just a special case of mixed radix.

Virtually mixed-radix allows us to compute the FFT on any length N of input data but as typical implementations use a limited number of optimized radixes such as radix-2, radix-3, radix-5, radix-7, the computation will be restricted to those N that are an expansion of these available factors. In our case we are going to use:

  • radix-2
  • radix-3
  • radix-4
  • radix-5
  • radix-7

 

so N must be compound of power of 2,3,5,7.

compound.png

 

Please note: In the factorization of N there is not 4 as it is a power of 2. The motivation behind the choice to implement an highly optimized radix-4 will be clear in the second article where we will compare the performance of radix-2 and mixed-radix algorithms.

 

Theory behind mixed-radix

 

This section will detail the maths behind the Cooley-Turkey mixed-radix algorithm and will explain concepts such as twiddle factor multiplication and digit-reverse.

Since factorization is the key principle of Cooley–Tukey mixed-radix algorithm, let's start to decompose N in 2 factors, Nx and Ny:

nx_ny.png

Let's arrange the input data x(n) in a matrix with Nx columns and Ny rows where n-th element can be addressed in the following manner:

relation_n.png

with:

  • nx = n % Nx, scans the columns - nx [0, Nx - 1]
  • ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]

i.e.

If we had N = 21 and we factorized N as N = 7 x 3, the 16-th element of the input array would be addressed:

  • ny = floor(16 / 7) = 2
  • nx = 16 % 7 = 2

 

address of 16-th element = 2 + 2 * 7

matrix_input.png

using this notation, the DFT can be written as follow:

dft_nx_ny.png

Now let's place the output elements X(k) in a matrix as well but with Nx rows and Ny columns. The motivation behind this choice will be clear during the demonstration.

The k-th element will be addressed with the following relation:

relation_k.png

For addressing the output elements, we use kx and ky:

  • kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
  • ky = (k % Ny), scans the columns - ky [0, Ny - 1]

 

Addressing the output elements is completely upside-down because ky scans the columns whilst kx the rows.

matrix_output.png

and the DFT becomes:

dft_nx_ny_2.png

Let's evaluate the complex exponential:

exponential.png

  • Term 1:Always 1 because multiple of 2π
  • Term 4: Kind of trigonometric complex constant (twiddle factor)

 

The exponential is easy to study thanks for choosing the output matrix with Nx rows and Ny columns.

 

Writing the 4th term as:

twiddle_factor_pre_stage.png

the final DFT's expression becomes:

final_expression_mixed_radix.png

This is the expression behind the computation of mixed-radix algorithm in the case of 2 factors.

 

If we had for instance N = 21 and we factorized N as N = 7 x 3, we could graphically summarize this final expression in the following manner:

transpose_fft_steps.png

Which can be translated in the following pipeline:

twiddle_factor_multiplication_2stages_mult2.png

 

Summary

 

  1. If Nx and Ny were composite numbers, we could again apply the decomposition to Nx and Ny
  2. The outer sum in the final expression is the DFT of each row after first multiplying by a proper trigonometric complex constant called "twiddle factor"
  3. The final result does not change by swapping the order of the factors
  4. Looking at the picture we can easily guess that each stage of the pipeline is an embarrassingly parallel problem.

 

In the end from the demonstration we can highlight the 3 main computation blocks behind this algorithm:

  1. Digit-reverse stage in order to have the output complex values in the correct order
  2. Twiddle factor multiplication stage which multiplies each value by a proper complex constant before transmitting to the next radix stage
  3. Radix stage which computes an highly optimized DFT

 

With this first article we have finished to illustrate the minimal background behind FFT mixed-radix algorithm and we are definitely ready to dive in the next article where we are going to play with the OpenCL kernels

 

Ciao,

Gian Marco

 

ReadSpeeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

ReadSpeeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

 

References

 

  1. https://community.arm.com/groups/arm-mali-graphics/blog/2014/05/30/fast-fourier-transformation-vectorization-on-arm-mali-gpus
  2. http://www.engineeringproductivitytools.com/stuff/T0001/PT07.HTM
  3. Fast Fourier Transform - Algorithms and Applications, K. R. Rao, Do Nyeon Kim, Jae Jeong Hwang
  4. The Scientist and Engineer's Guide to Digital Signal Processing, Steven W. Smith

Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

$
0
0

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

 

Here we are for the second part of our blog series about the OpenCL™ implementation of Complex to Complex Fast Fourier Transform based on the method mixed-radix on Mobile ARM® Mali GPUs.

Whilst in the first article - Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1 - we presented the basic mathematical background, in the following we are going to play with the 3 main computation blocks behind the FFT mixed-radix which will be analyzed in a step-by-step approach from the point of view of both theory and of its efficient and simple implementation with OpenCL.

The development platform used to carry out our performance analysis will be the Firefly-RK3288 which is an all-in-one high performance multi-core computing platform based on the Rockchip RK3288 SoC featuring an ARM® Cortex®-A17 @1.8GHz and ARM® Mali™-T760 (Mali GPU Driver: r6p0-02rel0).

For all performance evaluations, DVFShas been disabled and theGPU clock has been set at the maximum frequency of 600MHz.

Further information about the development platform can be found at:

 

pararm-fire.png

 

Implementation

 

Let's take a look at how we store our complex values. The input and output data are sorted into a floating point complex array with 2*N elements where the real and imaginary parts are placed alternately. Two different buffers with 2*N elements have been used to store the input and output values.

 

array.png

 

The FFT mixed-radix pipeline that we are going to study is composed mainly of 3 computation blocks:

  1. Digit-reverse
  2. Twiddle factor multiplication
  3. Radix computation

pipeline_separated.png

 

Indexing scheme

 

The only real complexity of FFT mixed-radix is the indexing scheme. The indexing scheme expresses the relation between the index n and kbetween 2 generic stages X and Y. This relation is fundamental because it allows both to know which input must be processed by each radix basic element and to correctly perform the stages of digit reverse and twiddle factors multiplication.

 

Given 2 generic stages X and Y:

  1. n addresses the values in the stage X: n = nx + ny * Nx
  2. k addresses the values in the stage Y: k = ky + kx * Ny

with:

  • nx = n % Nx, scans the columns - nx [0, Nx - 1]
  • ny = floor(n / Nx), scans the rows - ny [0, N / Nx - 1]
  • kx = floor(k / Ny), scans the rows - kx [0, N / Ny - 1]
  • ky = (k % Ny), scans the columns - ky [0, Ny - 1]

n_k.png

From the above picture we can easily see what nx, ny, kx and ky are. For what concerns Nx and Ny, in the case of just 2 radix stages they are respectively the radix order of stage X and Y. In the case of more than 2 radix stages, Ny is still the radix order of stage Y but Nx becomes the radix products from the first radix stage to the radix stage X.

 

Nx is also known as span or butterfly span. The span is the offset between each complex input of radix basic element and it was introduced in the first article in reference to the radix-2 FFT algorithm.

nx_radix_s.png

i.e.

Let's assume we have N = 8 x 4 x 4 x 3 x 2 (5 stages).

If the stages X and Y were stage 2 and stage 3, Nx would be Nx = 8 x 4 x 4 = 128 and Ny = 3

 

With a pipeline made up of M radix stages,we would like to have a relation between the index k and n between 2 generic stages, in particular we would like to have something like:

 

map_kn.png

mapping_function_nk.png

 

The general expressions for computing these mapping functions are:

relation_nk.png

n_to_k_relation.png

where Ni is:

ni.png

 

/**
 * @brief Map index k to index n
 *
 * @param[in] k  Index k
 * @param[in] Nx It is the span
 * @param[in] Ny It is the radix order of stage Y
 *
 * @return       Index n
 */
uint map_k_to_n(uint k, uint Nx, uint Ny)
{    uint Ni = Nx * Ny;    uint ky = k % Ny;    // Remainder    uint kx = k / Ny;    // Integer part    uint n = (kx % Nx) + (kx / Nx) * Ni + ky * Nx;    return n;
}

 

/**
 * @brief Map index n to index k
 *
 * @param[in] n  Index n
 * @param[in] Nx It is the span
 * @param[in] Ny It is the radix order of stage Y
 *
 * @return       Index k
 */
uint map_n_to_k(uint n, uint Nx, uint Ny)
{    uint Ni = Nx * Ny;    uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni);    return k;
}

 

Every time we compute a radix stage, it is good to keep the span Nx update.

 

/* Init Nx to 1 */
uint Nx = 1;

/* Scan each radix stage */
for(uint s = 0; s < n_stages; ++s)
{     /* Get radix order of stage s */     uint Ny = radix[s];         /* Body for computing twiddle factor multiplication and radix computation */     ...         /* Update Nx */     Nx *= Ny;
}

 

 

Digit-reverse

 

Once we have introduced the indexing scheme, we are ready to analyze the main computation blocks of our pipeline. Let's start with the digit-reverse.

The digit-reverse is the first stage of our pipeline which places the input elements in a specific order called "digit-reverse order".

The digit-reverse order would be exactly the order of the output elements if we left the input elements in linear order (0, 1, 2,...).

 

Since we know the relation between the index n and k, a possible way of knowing the digit-reverse order may be to iterate the mapping function map_n_to_k()from the first to the last stage in order to know how the input index n would be mapped at the end.

 

digit_reverse.png

/**
 * @brief This function computes the digit reverse index for each input
 *
 * @param[in]  stage             It contains the radix order for each radix stage
 * @param[out] idx_digit_reverse It contains the digit-reverse order index
 * @param[in]  n_stages          Total number of radix stages
 * @param[in]  N                 Number of input
 */
void digit_reverse(float* stage, uint* idx_digit_reverse, uint n_stages, uint N)
{    /* Scan elements */    for(uint n = 0; n < N; ++n)    {        uint k = n;        uint Nx = stage[0];        /* Scan stages */        for (uint s = 1; s < n_stages; ++s)        {            /* radix of stage s-th */            uint Ny = stage[s];            uint Ni = Ny * Nx;            /* Update k index */            k = (k * Ny) % Ni + (k / Nx) % Ny + Ni * (k / Ni);            /* Update Nx */            Nx *= Ny;        }        /* K is the index of digit-reverse */        idx_digit_reverse[n] = k;    }
}

 

Once we know the digit-reverse index, we can implement a CL kernel that places the input values in the required order.

 

/**
 * @brief This kernel stores the input in digit-reverse order
 *
 * @param[in]  input            It contains the input complex values in linear order
 * @param[out] output           It contains the output complex values in digit-reverse order
 * @param[in]  idx_bit_reverse  It contains the digit-reverse order index
 */
kernel void digit_reverse(global float2* input, global float2* output, global uint* idx_digit_reverse)
{    /* Each work-item stores a single complex values */    const uint n = get_global_id(0);    /* Get digit-reverse index */    const uint idx = idx_digit_reverse[n];    /* Get complex value */    float2 val = (float2)(input[idx]);    /* Store complex value */    output[n] = val;
}

 

 

The digit-reverse kernel uses 2 different cl_buffers:

  1. one for loading the input complex values
  2. one for storing the output complex values in digit-reverse order.

 

Twiddle Factor Multiplication

 

Twiddle factor multiplication is the intermediate stage between 2 generic radix stages X and Y. In particular it multiplies all the N output complex elements of radix stage X by a specific trigonometric complex constant before passing those values to the next stage Y.

 

This trigonometric complex constant -the "twiddle" factor - depends on nx and ky:
twiddle_factor_pre_stage.png

Even the twiddle factor multiplication is highly parallelizable and can be implemented by means of OpenCL in the following manner:

 

#define M_2PI_F 6.283185482025146484375f

#define TWIDDLE_FACTOR_MULTIPLICATION(phi, input)                 \
{                                                                 \    float2 w, tmp;                                                \    w.x = cos(phi);                                               \    w.y = sin(phi);                                               \             tmp.x = (w.x * input.x) - (w.y * input.y);                    \    tmp.y = (w.x * input.y) + (w.y * input.x);                    \    input = tmp;                                                  \
}

/**
 * @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
 *
 * @param[in, out] input It contains the input and output complex values
 * @param[in]      Nx    It is the span
 * @param[in]      Ny    It is the radix order of stage Y
 * @param[in]      Ni    Nx * Ny
 */
kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni)
{    /* Each work-item computes a single complex input */    const uint n = get_global_id(0);    /* Compute nx */    uint nx = n % Nx;    /* Compute k index */    uint k = (n * Ny) % Ni + (n / Nx) % Ny + Ni * (n / Ni);    /* Compute ky */    uint ky = k % Ny;    /* Compute angle of twiddle factor */    float phi = (-M_2PI_F * nx * ky) / (float)Ni;    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}

 

The above kernel uses the same cl_buffer for loading and storing.

 

An interesting optimization concerns the computation of ky. As we know,

ky_computation.png

so:

ky_expansion.png

since in the above expression A and B are multiples of Ny, the terms 1 and 3 will be 0therefore:

ky_reduced.png

but for the identity propertyof the modulo operator we have:

modulo_identity.png

and then we can compute ky as:

ky_final.png

 

At this point we can simplify our OpenCL kernel in the following manner:

 

kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni)
{    /* Each work-item computes a single complex input */    const uint n = get_global_id(0);    /* Compute nx */    uint nx = n % Nx;    /* Compute ky */    uint ky = (n / Nx) % Ny;               // <-    /* Compute angle of twiddle factor */    float phi = (-M_2PI_F * nx * ky) / (float)Ni;    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}

 

Another important and simple optimization that we can introduce concerns the computation of the angle for the twiddle factor multiplication. As we can observe, the term (-M_2PI_F / (float)Ni) is a constantfor all work-items and can therefore be passed as a argument to the CL kernel.

 

/**
 * @brief This kernel computes the twiddle factor multiplication between 2 generic radix stages X and Y
 *
 * @param[in, out] input     It contains the input and output complex values
 * @param[in]      Nx        It is the span
 * @param[in]      Ny        It is the radix order of stage Y
 * @param[in]      Ni        Nx * Ny
 * @param[in]      exp_const (-M_2PI_F / (float)Ni)
 */
kernel void twiddle_factor_multiplication(global float2* input, const uint Nx, const uint Ny, const uint Ni, const float exp_const)
{    /* Each work-item computes a single complex input */    const uint n = get_global_id(0);    /* Compute nx */    uint nx = n % Nx;    /* Compute ky */    uint ky = (n / Nx) % Ny;    /* Compute angle of twiddle factor */    float phi = (float)(nx * ky) * exp_const;                    // <-    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, input[n]);
}

 

With these 2 simple optimizations, we are able to improve the performance of this kernel by a factor of~1.6/1.7x as illustrated in the following graph.

 

twiddle_factor_optimized.png

 

Radix computation

 

The radix stage is the main computation block of the pipeline composed of N / Sradix-S basic elements.

Each radix-S consists of S inputs and S outputs and computes an optimized DFT of length S.

radix_s.png

Since there is not a dependency between each radix basic element of the same stage, even the radix stage is an embarrassingly parallel problem.

 

7x5x3.png

 

In our implementation each radix-S of the same radix stage is performed by a single work-item. As in the first stage the radix basic element takes the input in linear orderwhilst for the following stages accordingly with the span Nx, we are going to use 2 different kernels for the radix basic element:

  1. One kernel for the first stage which takes the input in linear order
  2. One kernel for the other radix stages

 

Both kernels use a single cl_buffer for loading and storing.

 

In the following we are going to describe the implementation of radix 2/3/4/5/7.

 

Radix-2

 

Defined as:

radix_2.png

can be expressed in the following matrix form:

matrix_form_radix_2.png

 

#define DFT_2(c0, c1) \
{                     \    float2 v0;        \    v0 = c0;          \    c0 = v0 + c1;     \    c1 = v0 - c1;     \
}

/**
 * @brief This kernel computes DFT of size 2 for the first stage
 *
 * @param[in, out] input It contains the input and output complex values
 */
kernel void radix_2_first_stage(global float* input)
{
    /* Each work-item computes a single radix-2 */    uint idx = get_global_id(0) * 4;    /* Load two complex input values */    float4 in = vload4(0, input + idx);    /* Compute DFT N = 2 */    DFT_2(in.s01, in.s23);    /* Store two complex output values */    vstore4(in, 0, input + idx);
}

/**
 * @brief This kernel computes DFT of size 2 for the radix stages after the first
 *
 * @param[in, out] input It contains the input and output complex value
 * @param[in]      Nx    It is the span
 * @param[in]      Ni    Nx * Ny
 */
kernel void radix_2(global float2* input, uint Nx, uint Ni)
{    /* Each work-item computes a single radix-2 */    uint kx = get_global_id(0);    /* Compute n index */    uint n = (kx % Nx) + (kx / Nx) * Ni;    /* Load two complex input values */    float2 c0 = input[n];    float2 c1 = input[n + Nx];    /* Compute DFT N = 2 */    DFT_2(c0, c1);    /* Store two complex output values */    input[n] = c0;    input[n + Nx] = c1;
}

 

Radix-3

 

Defined as:

radix_3.png

can be expressed in the following matrix form:

matrix_form_radix_3.png

 

#define SQRT3DIV2       0.86602540378443f

#define DFT_3(c0, c1, c2)                          \
{                                                  \    float2 v0 = c1 + c2;                           \    float2 v1 = c1 - c2;                           \    c1.x = c0.x - 0.5f * v0.x + v1.y * SQRT3DIV2;  \    c1.y = c0.y - 0.5f * v0.y - v1.x * SQRT3DIV2;  \    c2.x = c0.x - 0.5f * v0.x - v1.y * SQRT3DIV2;  \    c2.y = c0.y - 0.5f * v0.y + v1.x * SQRT3DIV2;  \    c0 = c0 + v0;                                  \
}

/**
 * @brief This kernel computes DFT of size 3 for the first stage
 *
 * @param[in, out] input It contains the input and output complex values
 */
kernel void radix_3_first_stage(global float* input)
{
    /* Each work-item computes a single radix-3 */    uint idx = get_global_id(0) * 6;    /* Load three complex input values */    float4 in0 = vload4(0, input + idx);    float2 in1 = vload2(0, input + idx + 4);    /* Compute DFT N = 3 */    DFT_3(in0.s01, in0.s23, in1.s01);    /* Store three complex output values */    vstore4(in0, 0, input + idx);    vstore2(in1, 0, input + idx + 4);
}

/**
 * @brief This kernel computes DFT of size 3 for the radix stages after the first
 *
 * @param[in, out] input It contains the input and output complex value
 * @param[in]      Nx    It is the span
 * @param[in]      Ni    Nx * Ny
 */
kernel void radix_3(global float2* input, uint Nx, uint Ni)
{    /* Each work-item computes a single radix-3 */    uint kx = get_global_id(0);    /* Compute n index */    uint n = (kx % Nx) + (kx / Nx) * Ni;    /* Load three complex input values */    float2 c0 = input[n];    float2 c1 = input[n + Nx];    float2 c2 = input[n + 2 * Nx];    /* Compute DFT N = 3 */    DFT_3(c0, c1, c2);    /* Store three complex output values */    input[n] = c0;    input[n + Nx] = c1;    input[n + 2 * Nx] = c2;
}

 

Radix-4

 

Defined as:

radix_4.png

can be expressed in the following matrix form:

matrix_form_radix_4.png

 

#define DFT_4(c0, c1, c2, c3) \
{                             \    float2 v0, v1, v2, v3;    \    v0 = c0 + c2;             \    v1 = c1 + c3;             \    v2 = c0 - c2;             \    v3.x = c1.y - c3.y;       \    v3.y = c3.x - c1.x;       \    c0 = v0 + v1;             \    c2 = v0 - v1;             \    c1 = v2 + v3;             \    c3 = v2 - v3;             \
}

/**
 * @brief This kernel computes DFT of size 4 for the first stage
 *
 * @param[in, out] input It contains the input and output complex values
 */
kernel void radix_4_first_stage(global float* input)
{
    /* Each work-item computes a single radix-4 */    uint idx = get_global_id(0) * 8;    /* Load four complex input values */    float8 in = vload8(0, input + idx);    /* Compute DFT N = 4 */    DFT_4(in.s01, in.s23, in.s45, in.s67);    /* Store four complex output values */    vstore8(in, 0, input + idx);
}

/**
 * @brief This kernel computes DFT of size 4 for the radix stages after the first
 *
 * @param[in, out] input It contains the input and output complex value
 * @param[in]      Nx    It is the span
 * @param[in]      Ni    Nx * Ny
 */
kernel void radix_4(global float2* input, uint Nx, uint Ni)
{    /* Each work-item computes a single radix-4 */    uint kx = get_global_id(0);    /* Compute n index */    uint n = (kx % Nx) + (kx / Nx) * Ni;    /* Load four complex input values */    float2 c0 = input[n];    float2 c1 = input[n + Nx];    float2 c2 = input[n + 2 * Nx];    float2 c3 = input[n + 3 * Nx];    /* Compute DFT N = 4 */    DFT_4(c0, c1, c2, c3);    /* Store four complex output values */    input[n] = c0;    input[n + Nx] = c1;    input[n + 2 * Nx] = c2;    input[n + 3 * Nx] = c3;
}

 

 

Radix-5

 

Defined as:

radix_5.png

can be expressed in the following matrix form:

matrix_form_radix_5.png

 

#define W5_A    0.30901699437494f
#define W5_B    0.95105651629515f
#define W5_C    0.80901699437494f
#define W5_D    0.58778525229247f

#define DFT_5(c0, c1, c2, c3, c4)               \
{                                               \    float2 v0, v1, v2, v3, v4;                  \    v0 = c0;                                    \    v1 = W5_A * (c1 + c4) - W5_C * (c2 + c3);   \    v2 = W5_C * (c1 + c4) - W5_A * (c2 + c3);   \    v3 = W5_D * (c1 - c4) - W5_B * (c2 - c3);   \    v4 = W5_B * (c1 - c4) + W5_D * (c2 - c3);   \    c0 = v0 + c1 + c2 + c3 + c4;                \    c1 = v0 + v1 + (float2)(v4.y, -v4.x);       \    c2 = v0 - v2 + (float2)(v3.y, -v3.x);       \    c3 = v0 - v2 + (float2)(-v3.y, v3.x);       \    c4 = v0 + v1 + (float2)(-v4.y, v4.x);       \
}

/**
 * @brief This kernel computes DFT of size 5 for the first stage
 *
 * @param[in, out] input It contains the input and output complex values
 */
kernel void radix_5_first_stage(global float* input)
{
    /* Each work-item computes a single radix-5 */    uint idx = get_global_id(0) * 10;    /* Load five complex input values */    float8 in0 = vload8(0, input + idx);    float2 in1 = vload2(0, input + idx + 8);    /* Compute DFT N = 5 */    DFT_5(in0.s01, in0.s23, in0.s45, in0.s67, in1.s01);    /* Store five complex output values */    vstore8(in0, 0, input + idx);    vstore2(in1, 0, input + idx + 8);
}

/**
 * @brief This kernel computes DFT of size 5 for the radix stages after the first
 *
 * @param[in, out] input It contains the input and output complex value
 * @param[in]      Nx    It is the span
 * @param[in]      Ni    Nx * Ny
 */
kernel void radix_5(global float2* input, uint Nx, uint Ni)
{    /* Each work-item computes a single radix-5 */    uint kx = get_global_id(0);    /* Compute n index */    uint n = (kx % Nx) + (kx / Nx) * Ni;    /* Load five complex input values */    float2 c0 = input[n];    float2 c1 = input[n + Nx];    float2 c2 = input[n + 2 * Nx];    float2 c3 = input[n + 3 * Nx];    float2 c4 = input[n + 4 * Nx];    /* Compute DFT N = 5 */    DFT_5(c0, c1, c2, c3, c4);    /* Store five complex output values */    input[n] = c0;    input[n + Nx] = c1;    input[n + 2 * Nx] = c2;    input[n + 3 * Nx] = c3;    input[n + 4 * Nx] = c4;
}

 

Radix-7

 

Defined as:

radix_7.png

can be expressed in the following matrix form:

matrix_form_radix_7.png

 

#define W7_A    0.62348980185873f
#define W7_B    0.78183148246802f
#define W7_C    0.22252093395631f
#define W7_D    0.97492791218182f
#define W7_E    0.90096886790241f
#define W7_F    0.43388373911755f

#define DFT_7(c0, c1, c2, c3, c4, c5, c6)                           \
{                                                                   \    float2 v0, v1, v2, v3, v4, v5, v6;                              \    v0 = c0;                                                        \    v1 = W7_A * (c1 + c6) - W7_C * (c2 + c5) - W7_E * (c3 + c4);    \    v2 = W7_C * (c1 + c6) + W7_E * (c2 + c5) - W7_A * (c3 + c4);    \    v3 = W7_E * (c1 + c6) - W7_A * (c2 + c5) + W7_C * (c3 + c4);    \    v4 = W7_B * (c1 - c6) + W7_D * (c2 - c5) + W7_F * (c3 - c4);    \    v5 = W7_D * (c1 - c6) - W7_F * (c2 - c5) - W7_B * (c3 - c4);    \    v6 = W7_F * (c1 - c6) - W7_B * (c2 - c5) + W7_D * (c3 - c4);    \    c0 = v0 + c1 + c2 + c3 + c4 + c5 + c6;                          \    c1 = v0 + v1 + (float2)(v4.y, -v4.x);                           \    c2 = v0 - v2 + (float2)(v5.y, -v5.x);                           \    c3 = v0 - v3 + (float2)(v6.y, -v6.x);                           \    c4 = v0 - v3 + (float2)(-v6.y, v6.x);                           \    c5 = v0 - v2 + (float2)(-v5.y, v5.x);                           \    c6 = v0 + v1 + (float2)(-v4.y, v4.x);                           \
}

/**
 * @brief This kernel computes DFT of size 7 for the first stage
 *
 * @param[in, out] input It contains the input and output complex values
 */
kernel void radix_7_first_stage(global float* input)
{
    /* Each work-item computes a single radix-7 */    uint idx = get_global_id(0) * 14;    /* Load seven complex input values */    float8 in0 = vload8(0, input + idx);    float4 in1 = vload4(0, input + idx + 8);    float2 in2 = vload2(0, input + idx + 12);    /* Compute DFT N = 7 */    DFT_7(in0.s01, in0.s23, in0.s45, in0.s67, in1.s01, in1.s23, in2.s01);    /* Store seven complex output values */    vstore8(in0, 0, input + idx);    vstore4(in1, 0, input + idx + 8);    vstore2(in2, 0, input + idx + 12);
}

/**
 * @brief This kernel computes DFT of size 7 for the radix stages after the first
 *
 * @param[in, out] input It contains the input and output complex value
 * @param[in]      Nx    It is the span
 * @param[in]      Ni    Nx * Ny
 */
kernel void radix_7(global float2* input, uint Nx, uint Ni)
{    /* Each work-item computes a single radix-7 */    uint kx = get_global_id(0);    /* Compute n index */    uint n = (kx % Nx) + (kx / Nx) * Ni;    /* Load seven complex input values */    float2 c0 = input[n];    float2 c1 = input[n + Nx];    float2 c2 = input[n + 2 * Nx];    float2 c3 = input[n + 3 * Nx];    float2 c4 = input[n + 4 * Nx];    float2 c5 = input[n + 5 * Nx];    float2 c6 = input[n + 6 * Nx];    /* Compute DFT N = 7 */    DFT_7(c0, c1, c2, c3, c4, c5, c6);    /* Store seven complex output values */    input[n] = c0;    input[n + Nx] = c1;    input[n + 2 * Nx] = c2;    input[n + 3 * Nx] = c3;    input[n + 4 * Nx] = c4;    input[n + 5 * Nx] = c5;    input[n + 6 * Nx] = c6;
}

 

 

Merging twiddle factor multiplication with radix stage

 

Generally there are a couple of important benefits to combining 2 OpenCL kernels into one:

  1. fewer GPU jobs to dispatch thus reducing a possible driver overhead. In our implementation we are going to have (log2(N) - 1)fewer GPU jobs
  2. fewer memory accesses which is a great thing not only for performance but also for power consumption

Regarding this last aspect, we can easily guess that, since both twiddle factor multiplication and radix computation use the same cl_buffer for loading and storing, if we had 2 separated kernels we would access the same memory location twice.

 

Combining these 2 kernels, the new pipeline becomes:

pipeline_merged.png

 

In order to compute the twiddle factor multiplication inside the radix kernel, we need just a small tweak as we can see, for instance, in the following radix-5 OpenCL kernel:

 

/**
 * @brief This kernel computes DFT of size 5 for the radix stages after the first
 *
 * @param[in, out] input     It contains the input and output complex value
 * @param[in]      Nx        It is the span
 * @param[in]      Ni        Nx * Ny
 * @param[in]      exp_const (-M_2PI_F / (float)Ni)
 */
kernel void radix_5(global float2* input, uint Nx, uint Ni, float exp_const)
{    /* Each work-item computes a single radix-5 */    uint kx = get_global_id(0);    /* Compute nx */    uint nx = kx % Nx;                                   // <-    /* Compute n index */    uint n = nx + (kx / Nx) * Ni;                        // <-    /* Load five complex input values */    float2 c0 = input[n];    float2 c1 = input[n + Nx];    float2 c2 = input[n + 2 * Nx];    float2 c3 = input[n + 3 * Nx];    float2 c4 = input[n + 4 * Nx];    /* Compute phi */    float phi = (float)nx * exp_const;                  // <- Please note: there is not ky    /* Multiply by twiddle factor */    TWIDDLE_FACTOR_MULTIPLICATION(phi, c1);    TWIDDLE_FACTOR_MULTIPLICATION(2 * phi, c2);    TWIDDLE_FACTOR_MULTIPLICATION(3 * phi, c3);    TWIDDLE_FACTOR_MULTIPLICATION(4 * phi, c4);    /* Compute DFT N = 5 */    DFT_5(c0, c1, c2, c3, c4);    /* Store five complex output values */    input[n] = c0;    input[n + Nx] = c1;    input[n + 2 * Nx] = c2;    input[n + 3 * Nx] = c3;    input[n + 4 * Nx] = c4;
}

 

From the following graph we can notice that the speed-up becomes considerable (~1.5x) already for small N.

 

separated_vs_merged.png

 

Mixed-Radix vs Radix-2

 

We are approaching the end of this second part but before we finish, we'd like to present a final comparison between mixed-radix and radix-2.

In the first article we introduced mixed-radix as the solution to efficiently overcoming the problem of N not being a power of 2. However if we had N power of 2, what would be the benefit of using mixed-radix rather than radix-2?

As we know, radix-2 is just a special case of mixed-radix. Since in our implementation we have the radix-4 as well, 2 consecutive radix-2 stages can be merged in a single radix-4 stage thus reducing:

  1. The number of radix stages to compute
  2. The number of memory accesses

The following graph shows the speed-up achievable by mixed-radix with radix-4 against a pure radix-2 implementation. As we can appreciate, the speed-up is already considerable for small N getting 1.7x better performance for N greater than 4096.

radix2_mixed.png

 

Summary

 

In this second article we presented the implementation of the 3 main FFT mixed-radix computation blocks by means of OpenCL. Regarding the twiddle factor multiplications we showed how simple changes in CL kernels can significantly speed-up the computation.

We also saw that, although the nature of the pipeline is sequential, each stage represents an embarrassingly parallel problem that can be easily and efficiently implemented with OpenCL.

In the end, with the final comparison between mixed-radix and radix-2, we appreciated that mixed-radix not only provides the important flexibility in the choice of N but also provides a considerable speed-up with N power of 2 if we have the radix-4 as well.

 

In the next and final article of this blog series we are going to learn how to use the implemented FFT mixed-radix as a building block for computing the FFT for the case of 2 dimensions. This is mostly used in Image Processing, Computer Vision and Machine Learning applications.

 

Ciao,

Gian Marco

 

 

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Read Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

ARM Vulkan: Built for Modern Mobile Systems

$
0
0

ARM Logo.pngvulkan logo.png

Built for Modern Mobile Systems

 

ARM, along with other key industry members, made a lot of progress since the Vulkan announcement by the Khronos Group back in early 2015 and we’re now in the final stage of development.

 

What is Vulkan?

Vulkan is a new, open standard, cross platform API designed from the ground up to enable applications to run more efficiently on modern CPU and GPU architectures spanning desktop, console and most importantly, mobile.

 

Who’s going to use Vulkan?

Every application on a mobile device uses graphics in some way, for UI if nothing else. This includes the operating system, with Google in particular having made very clear statements of Vulkan support for Android, suggesting it will very likely become the core graphics API for that platform. Vulkan has received an unprecedented amount of industry support from OS vendors, silicon vendors, game engine developers and software developers.

 

lab.jpg

Higher performance at lower power

On mobile platforms Vulkan is a game changer, offering lower overhead, more direct control over the GPU and lower CPU usage. The end-user benefit is far more than bigger and better game graphics. Vulkan is a highly efficient API enabling developers to optimize their applications for significantly lower energy consumption than was previously possible.

 

Lower CPU usage and minimum energy consumption
Multi-thread / multicore friendly

Traditional graphics APIs were not designed for multi-threaded use and required a lot of synchronization with the CPU to manage draw calls, resulting in high CPU overhead with the CPU becoming a bottleneck, especially on mobile devices. Vulkan was designed from the ground up to minimize driver overhead, enable CPU multi-threading with greater control over memory management for high-efficiency graphics and compute processing performance.

  vulkan blog 2.png

Direct GPU control with lower driver overhead

Applications using traditional graphics APIs can easily max out CPU usage during complex drawing, driving up energy consumption and generating very high device heat. GPU drivers needed to do a lot of work leveraging CPU to manage multiple rendering and compute processes. Vulkan provides developers with much greater control over GPU processing, enabling them to more easily balance graphics and compute loads across multiple cores. This results in low per core usage, lower overall power consumption, less device heat and considerably longer battery life.

  Vulkan blog 3.png

With great power comes great responsibility

Vulkan has been designed with modern mobile GPU rendering and compute technologies in mind. Application developers can now manage how things are ordered and executed on their own rather than leaving it up to a generic driver. This gives them more work to do but also provides more flexibility in designing applications with better performance and lower energy consumption than was possible using OpenGL ES.

 

What about ARM?

Quote from the Vulkan March 2015 Khronos announcement …

“Since helping found Khronos, ARM has strived to improve the efficiency of standards and deliver compelling graphics to mobile devices with minimum energy consumption,”said Jem Davies, vice president of technology, media processing group, ARM. “Vulkan is a big step forward in enabling our ecosystem of developers to unleash the capabilities of the latest ARM GPU technology.”

ARM, along with other key industry players, have been very busy refining the Vulkan API and ensuring it will deliver the modern day mobile needs. We will offer Vulkan support* on several GPUs shipping today and this support will be a key deliverable for many future Mali GPUs.

 

*Based on an internal draft Khronos specification of Vulkan, which may change prior to final release. Conformance criteria for this specification have not yet been established.


Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 3

$
0
0

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Back toSpeeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

 

In this third and last part of this blog series we are going to extend the mixed-radix FFT OpenCL™ implementation to two dimensions and explain how to perform optimisations to achieve good performance on Mobile ARM® Mali™ GPUs. After a short introduction why two-dimensional Fourier transform is an important and powerful tool the first section covers the mathematical background that lets us reuse the existing one-dimensional approach to build the two-dimensional implementation. Afterwards all the steps are described individually and the details are explained. Finally, at the end of each section we show the performance improvements achieved.

 

So far only the one-dimensional Fourier transform has been considered to explain the mathematical background and basic optimisations. However, in the field of image processing the two-dimensional variant is much more popular. For instance, a simple high-pass filter can be realised in the frequency domain after the image have been transformed by a two-dimensional Fourier transform. The resulting image could be used to implement an edge detection algorithm.

 

firefly.pngfirefly_spec.pngfirefly_spec2.pngfirefly_edge.png
Original image (spatial domain)Original image (frequency domain)Removed high frequenciesFiltered image (spatial domain)
High-pass filter via Fourier transform

 

In general the Fourier transformation can be used to make the convolution operations which are used in many image processing applications more efficient. In particular when the kernel size is big the improvements in terms of the number of operations that have to be executed are significant. In the spatial domain a two-dimensional convolution requires a quadratic number of multiplications and a linear number of additions in the order of the size of the convolution kernel for every element of the image. Therefore the resulting complexity is O(MNmn). In contrast, in the frequency domain the convolution corresponds to a pointwise multiplication of the transformed kernel and image which reduces the complexity for the convolution to O(MN). Nevertheless, since the Fourier transform has to be additionally computed it is not always beneficial to perform the convolution in the frequency domain.

 

Background

Due to its separability the two-dimensional Fourier transformation can be expressed as two consecutive one-dimensional transformations. The mathematical principles are no different from those explained in the first part of this blog series which means that this background section will be rather short. It just shows briefly how the general form of the two-dimensional discrete Fourier transform can be split into its one-dimensional parts.

 

The two dimensional DFT can be expressed in the form

fft_derivation_1.png

where the trigonometric constant coefficients are defined as

fft_derivation_4.png

The coordinates k and l of the frequency domain are in the range

fft_derivation_2.png

 

Since each of the trigonometric constant coefficients depends only on one of the summation indices the inner sum can be extracted and defined as

fft_derivation_3.png

which is equal to the definition which was previously used to express the one-dimensional DFT. Only the parameter n had to be added to select the correct value in the two-dimensional input space.

 

If the newly defined one-dimensional transform is inserted into the original expression for the two-dimensional DFT it is easy to see from the following equation that just another one-dimensional transform is calculated

fft_derivation_5.png

Because of the commutativity of the operations either of the sums can be extracted. Therefore it does not matter over which dimension the first one-dimensional transform is calculated.

 

Implementations

In the background section the separability of the two-dimensional Fourier transform has been used to represent the two-dimensional as two consecutive one-dimensional transforms, one along each dimension. The row-column algorithm is based on this mathematical fact and is used for this implementation. In this algorithm first a one-dimensional FFT is computed individually for every row of the two-dimensional input data. Afterwards a second one-dimensional FFT is performed for every column. But, instead of having two different functions to compute the FFT for the rows and columns, the same function is used and the input data is transposed between and after the two one-dimensional FFTs. This makes sense because the data layout is assumed to be row-major. Thus the most efficient way in terms of cache locality is to load data along the rows.

 

2d_fft_0.png2d_fft_1.png2d_fft_2.png2d_fft_3.png2d_fft_4.png
DFT2D_output_0.pngDFT2D_output_1.pngDFT2D_output_2.pngDFT2D_output_3.pngDFT2D_output_4.png
Original imageSeparate 1D FFTs along the rows (first dimension)Transpose intermediate imageSeparate 1D FFTs along the rows (second dimension)Transpose final image
Schematics and example of the row-column algorithm

 

Naive row-column algorithm

The naive implementation directly realises the principle of the row-column algorithm and the execution of the 2D FFT is therefore straightforward. The following diagram provides a first overview over the different stages of the 2D FFT and the pseudo code demonstrates the idea of a naive implementation.

 

overview.png

 

Pseudo code

// Run the 1D FFT along the rows of the input buffer
// The result is stored to a temporary buffer
fft_mixed_radix_2d_run_1d_part(input, tmp);


// Transpose the intermediate buffer in-place
transpose_complex_list(tmp);

// Run the 1D FFT along the rows of the (transposed) intermediate buffer
// Corresponds to the columns of the original buffer
// The result is stored to the output buffer
fft_mixed_radix_2d_run_1d_part(tmp, output);

// Transpose the output buffer in-place
transpose_complex_list(output);

 

The function to compute the 1D FFT is conceptually the same that has been used in the previous parts of this blog series except that it has to loop over all rows of the source image. As a reminder: The chosen variant of the mixed radix FFT requires the input values to be in digit-reversed order to compute the output values in linear order. Therefore the input values have to be shuffled before the actual FFT can be computed. The reordering is repeated for all rows of the matrix. Similarly the computation of the 1D FFT is repeated for all rows independently of each other.

 

/**
* @brief Computes a 1D FFT for each row of the the input
*
* @param[in]  input  The input matrix
* @param[out] output The output matrix
*/
fft_mixed_radix_2d_run_1d_part(input, output)
{  // Iterate over all row of the input matrix  for(row_in in input and row_out in output)  {    // Perform digit-reverse reordering    rev_row = digit_reverse(row_in);    // Calculate mixed radix 1D FFT using the reversed row    row_out = ...  }
}

 

Naive implementation based on OpenCL mixed-radix kernels

This optimisation does not introduce anything terribly new compared to the optimisation of the one-dimensional Fourier transform. The first speedup over the CPU implementation can be achieved by running the computation of the actual FFT on the GPU. For this purpose the buffer with the input values has to be mapped to a CLBuffer. Afterwards the reordering to digit-reversed order is performed on the GPU. The rows of the matrix can be reordered independently such that a one-dimensional kernel is enqueued for each one. The reordering is followed by the first one-dimensional part of the FFT which is now also computed on the GPU. Once the computation has finished the buffers are mapped again to perform the matrix transpose between the two stages of the FFT on the CPU and out-of-place. Subsequently, the rows are another time reordered and the second one-dimensional FFT is computed. Finally the results of the second stage are again transposed to restore the original data layout and copied over to the output buffer.

 

overview2.png

 

A loop is used to enqueue separate and independent one-dimensional kernels for each row of the input data. The kernels take the offset of the row within the linearised data representation as argument buf_offset_float.

Because each complex element of the buffer is represented through two floating point numbers (real and imaginary part) the row offset has to be doubled to get the actual offset within the buffer.

 

// Digit reverse stage
// Uses a 1D NDRange within the loop to process all rows of the buffer
const size_t digit_rev_gws[] = {N};

for(uint row = 0; row < N; ++row)
{    uint buf_offset_float = row * N * 2;    clSetKernelArg(digit_reverse_kernel, 0, sizeof(cl_mem), input_buf);    clSetKernelArg(digit_reverse_kernel, 1, sizeof(cl_mem), digit_reversed_buf);    clSetKernelArg(digit_reverse_kernel, 2, sizeof(cl_mem), idx_bit_reverse_buff);    clSetKernelArg(digit_reverse_kernel, 3, sizeof(cl_uint), &buf_offset_float);    clEnqueueNDRangeKernel(queue, digit_reverse_kernel, 1, NULL, digit_rev_gws, NULL, 0, NULL, NULL);
}

// Perform 2D FFT
// Uses 1D NDRange within the loop to process all rows of the buffer
for(uint row = 0; row < N; ++row)
{    uint buf_stride_float = N * 2;    uint Nx = 1;    uint Ny = radix_stage[0];    // First stage    const size_t first_stage_gws[] = {N / Ny};    clSetKernelArg(first_stage_radix_kernel, 0, sizeof(cl_mem), digit_reversed_buf);    clSetKernelArg(first_stage_radix_kernel, 1, sizeof(cl_uint), &buf_stride_float);    clEnqueueNDRangeKernel(queue, first_stage_radix_kernel, 1, NULL, first_stage_gws, NULL, 0, NULL, NULL);    // Update Nx    Nx *= Ny;    // Following stages    for(uint s = 1; s < n_stages; ++s)    {        // ...        }
}

 

/**
 * @brief Computes the first stage of a radix-2 DFT.
 *
 * @param[in, out] input  The complex array.
 * @param[in]      offset The offset of the current row in the array.
 */
kernel void radix_2_first_stage(global float* input, const uint offset)
{  // Each work-item computes a single radix-2  uint idx = get_global_id(0) * 4;  // Load two complex input values  // The row_offset needs to be multiplied by two because complex numbers  // are represented through two float values  float4 in = vload4(0, input + offset + idx);  // Compute DFT N = 2  DFT_2(in.s01, in.s23);  // Store two complex output values  // The row_offset needs to be multiplied by two because complex numbers  // are represented through two float values  vstore4(in, 0, input + offset + idx);
}

 

Replacing loops with two-dimensional kernels

In the first, naive GPU implementation separate one-dimensional kernels are enqueued for every row of the matrix. The performance can be significantly increased by using kernels with a two-dimensional NDRange instead of loops. The second dimension is then used as a replacement for the row offset within the linearised data structure which was previously incremented in the loop. In addition to the offset in the second dimension the offset in the buffer depends also on the size of the matrix which is passed as parameter N. Again the offset needs to be multiplied by two to account for the two part complex numbers.

 

overview3.png

 

First of all this change reduces the number of kernels and therefore the overhead of enqueuing and dispatching them. Moreover it increases the number of work items which can be executed in parallel since in principle work items from all rows can be executed in parallel. Before, the kernels had to be executed one after the other and thus there was only a comparably small number of work items at any point in time. This reduces the GPU's utilisation, particularly for small matrices.

 

// Digit reverse stage
// Uses a 2D NDRange to process all rows of the buffer
const size_t digit_rev_gws[] = {N, N};
clSetKernelArg(digit_reverse_kernel, 0, sizeof(cl_mem), input_buf);
clSetKernelArg(digit_reverse_kernel, 1, sizeof(cl_mem), digit_reversed_buf);
clSetKernelArg(digit_reverse_kernel, 2, sizeof(cl_mem), idx_bit_reverse_buf);
clSetKernelArg(digit_reverse_kernel, 3, sizeof(cl_uint), &N);
clEnqueueNDRangeKernel(queue, digit_reverse_kernel, 2, NULL, digit_rev_gws, NULL, 0, NULL, NULL);

uint buf_stride_float = N * 2;

// First stage
// Uses a 2D NDRange to process all rows of the buffer
uint Nx = 1;
uint Ny = radix_stage[0];

const size_t first_stage_gws[] = {N / Ny, N};
clSetKernelArg(first_stage_radix_kernel, 0, sizeof(cl_mem), digit_reversed_buf);
clSetKernelArg(first_stage_radix_kernel, 1, sizeof(cl_uint), &buf_stride_float);
clEnqueueNDRangeKernel(queue, first_stage_radix_kernel, 2, NULL, first_stage_gws, NULL, 0, NULL, NULL);

// Update Nx
Nx *= Ny;

// Following stages
for(uint s = 1; s < n_stages; ++s)
{    // ...    
}

 

/**
 * @brief Computes the first stage of a radix-2 DFT.
 *
 * @param[in, out] input            The complex array.
 * @param[in]      buf_stride_float The number of floats per row in the array.
 */
kernel void radix_2_first_stage(global float* input, const uint buf_stride_float)
{  // Each work-item computes a single radix-2  uint idx = get_global_id(0) * 4;  // Compute the offset into the buffer for the current row  // Needs to be multiplied by two because complex numbers  // are represented through two float values  const uint offset = get_global_id(1) * buf_stride_float; // <-- Previously computed in the host code  // Load two complex input values  float4 in = vload4(0, input + offset + idx);  // Compute DFT N = 2  DFT_2(in.s01, in.s23);  // Store two complex output values  vstore4(in, 0, input + offset + idx);
}

 

Parallel transpose on the GPU

The next optimisation moves the transpose operation so it is also running on the GPU. If it is not performed in-place there are no dependencies between the individual elements at all and the transpose can be fully executed in parallel. For in-place algorithms at least two elements have to swapped in each step as otherwise information would be overwritten.

 

overview4.png

 

Parallel elementwise out-of-place transpose

This approach resembles the CPU variant of a naive elementwise transpose of a matrix. Each work item reads one element according to its global id from the source buffer and writes it to the corresponding transposed position in the output buffer. The global work group size is thus equal to the size of the matrix (N x N). The elements themselves consist of two floating point values which represent the real and imaginary parts of the complex values.

 

/**
 * @brief Transposes a quadratic matrix of complex numbers stored in row major order.
 *
 * @param[in]  input              The complex input array.
 * @param[out] output             The complex output array.
 * @param[in]  buf_stride_complex The number of complex numbers per row in the array.
 */
kernel void transpose(global float2* input, global float2* output, const uint buf_stride_complex)
{    uint ix = get_global_id(0);    uint iy = get_global_id(1);    output[ix * buf_stride_complex + iy] = input[iy * buf_stride_complex + ix];
}
naive.png

 

Parallel blockwise out-of-place transpose

However, this simple approach is not efficient enough to compensate for the additional driver overhead for launching kernels when performing the transpose on the GPU. To increase the performance the transpose operation has to be executed blockwise instead of elementwise. Instead of reading and writing one pixel per work item a small sub-block of the complete matrix is loaded. It is then locally transposed and written back to the corresponding position in the output matrix.

 

Compared to the naive transpose approach this optimisation increases the cache efficiency and the resource utilisation since each work item operates on more than one pixel. Further it helps to increase the utilisation of the GPU resources by making the kernels larger and slightly more complex. As a result the number of work items decreases which reduces the overhead of dispatching work items for executuion. The choice of two-by-two sub-blocks turned out to be optimal since the local transpose of larger sub-blocks requires more registers.

 

/**
 * @brief Transposes a quadratic matrix of complex numbers stored in row major order.
 *
 * @param[in]  input            The complex input array.
 * @param[out] output           The complex output array.
 * @param[in]  buf_stride_float The number of floats per row in the array.
 */
kernel void transpose2x2(global float* input, global float* output, const uint buf_stride_float)
{    const uint ix = get_global_id(0);    const uint iy = get_global_id(1);    float4 tmp;    float4 u0, u1;    // Load one sub-block from two rows    u0 = vload4(ix, input + (iy * 2) * buf_stride_float);    u1 = vload4(ix, input + (iy * 2 + 1) * buf_stride_float);    // Transpose the sub-block    tmp = u0;    u0 = (float4)(tmp.s01, u1.s01);    u1 = (float4)(tmp.s23, u1.s23);    // Write the sub-block to the transposed position    vstore4(u0, iy, output + (ix * 2) * buf_stride_float);    vstore4(u1, iy, output + (ix * 2 + 1) * buf_stride_float);
}
blockwise.png

 

Speedup

image001.png

 

Parallel blockwise in-place transpose

Due to its nature a transpose operation can be realised by always swapping to corresponding elements or larger sub-blocks respectively. This allows us to load, transpose and store two sub-blocks per work item in order to reduce the number of work items and to achieve a better balance between arithmetic and load/store operations. As a side-effect the transpose algorithm can be changed from out-of-place to in-place such that the second buffer is no longer needed.

 

As a consequence of the transpose-and-swap approach it would be sufficient to have one work item for each block in the upper triangular matrix. However, if a one-dimensional kernel is enqueued with the smallest possible number of work items the transformation of the linear index into the two-dimensional index is less efficient than simply using a two-dimensional kernel and aborting the work items below the diagonal. The diagonal itself requires special treatment because the diagonal sub-blocks do not need to be exchanged with a different sub-block but only have to be transposed locally.

 

/**
 * @brief Transposes a quadratic matrix of complex numbers stored in row major order.
 *
 * @param[in, out] input            The complex array.
 * @param[in]      buf_stride_float The number of floats per row in the array.
 */
kernel void transpose2x2(global float* input, const uint buf_stride_float)
{    const uint ix = get_global_id(0);    const uint iy = get_global_id(1);    // Abort for sub-blocks below the diagonal    if(ix < iy)    {        return;    }    float4 tmp;    float4 v0, v1, u0, u1;    // Load first sub-block    u0 = vload4(ix, input + (iy * 2) * buf_stride_float);    u1 = vload4(ix, input + (iy * 2 + 1) * buf_stride_float);    // Transpose first sub-block    tmp = u0;    u0 = (float4)(tmp.s01, u1.s01);    u1 = (float4)(tmp.s23, u1.s23);    // Only process a second sub-block if the first one is not on the diagonal    if(ix != iy)    {        // Load second sub-block        v0 = vload4(iy, input + (ix * 2) * buf_stride_float);        v1 = vload4(iy, input + (ix * 2 + 1) * buf_stride_float);        // Transpose second sub-block        tmp = v0;        v0 = (float4)(tmp.s01, v1.s01);        v1 = (float4)(tmp.s23, v1.s23);        // Store second sub-block to transposed position        vstore4(v0, ix, input + (iy * 2) * buf_stride_float);        vstore4(v1, ix, input + (iy * 2 + 1) * buf_stride_float);    }    // Store first sub-block to transposed position    vstore4(u0, iy, input + (ix * 2) * buf_stride_float);    vstore4(u1, iy, input + (ix * 2 + 1) * buf_stride_float);
}
transpose.png

 

Speedup

image005.png

 

Setting local work group size

For the two-dimensional transpose kernel the automatically inferred values for the local work group size were not able to achieve good performance. It turned out to be better to set the local work group size manually to a size of 2 x 4 – or smaller if the buffer (divided by two because each work items handles a 2 x 2 sub-block) is not a multiple of 4. The best size for the local work groups always depends on the kernel and finding a good size can be hard. However, getting the sizes right can help to achieve better caching behaviour and therefore can improve runtime performance.

 

// Setting up constants
// For odd matrix sizes padding is necessary
// because the transpose kernel stores two complex values
const size_t padding = 2 * (N % 2);
const size_t buf_stride_float = N * 2 + padding;
const size_t buf_stride_half = buf_stride_float / 2;

clSetKernelArg(transpose_kernel, 0, sizeof(cl_mem), fft_buf);
clSetKernelArg(transpose_kernel, 1, sizeof(cl_uint), &buf_stride_float);

// Set global work group size
const size_t transpose_gws[2] = {buf_stride_half, buf_stride_half};
size_t transpose_lws[2];
size_t *transpose_lws_ptr;

// Determine the largest possible local work group size
// It has to divide the global work group size
if(buf_stride_half % 4 == 0)
{
    transpose_lws[0] = 2;    transpose_lws[1] = 4;    transpose_lws_ptr = transpose_lws;
}
else if(buf_stride_half % 2 == 0)
{    transpose_lws[0] = 1;    transpose_lws[1] = 2;    transpose_lws_ptr = transpose_lws;
}
else
{    transpose_lws_ptr = NULL;
}

clEnqueueNDRangeKernel(queue, transpose_kernel, 2, NULL, transpose_gws, transpose_lws_ptr, 0, NULL, NULL);

 

Speedup

image007.png

 

Comparison to radix-2 implementation

The following diagram visualises the benefit in terms of a shorter runtime when the mixed-radix variant is used instead of a radix-2 only implementation. Note the logarithmic scaling of the y-axis. Another advantage of a mixed-radix approach over a radix-2 only FFT is the larger number of supported sizes. For instance, FFTs for matrix sizes 120, 1000 and 3000 can not be computed with the radix-2 only approach and are therefore missing in the diagram.

image003.png

 

Conclusion

While this pushes the row-column algorithm to its boundaries it might be possible to achieve better performance if the underlying FFT algorithm is changed. For instance, if the FFT is not computed in-place the digit-reversal could be dropped and moreover the transpose could be merged into the mixed-radix FFT kernels. However, it is difficult to predict the benefits if there are any.

Useful optimisation aspects

The following list summarises the things that have been considered to increase the performance of the 2D FFT mixed-radix algorithm on Mobile ARM® Mali™ GPUs in this article:

  • Instead of enqueuing many one-dimensional kernels in a loop it is better to use one/fewer higher-dimensional kernel/s.
  • The optimal size of the sub-blocks for the blocked transpose depends on the available resources. Increasing the block size has only positive effects as long as the cache is big enough and sufficient registers are available.
  • Padding buffers can help to generalise kernels and can prevent branches in the control flow. For instance, the mapped buffers are padded to the next larger even size in both dimensions to eliminate edge cases during the transpose operation.
  • Loads and stores are most efficient if they are performed according to the data layout because only then the cache efficiency is more optimal.
  • It can be more efficient to conditionally abort superfluous work items in a 2D range directly instead of computing the two-dimensional index out of a linear index.
  • Find ways to parallelize sequential code and keep work items as independent from each other as possible.
  • Compute more than one pixel (or more bytes) per work item.
  • Reduce the number of load/store operations by vectorising. This is especially beneficial because the GPU comprises a vector unit (SIMD).

 

Further reading

 

Back to Speeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 1

Back toSpeeding-up Fast Fourier Transform Mixed-Radix on Mobile ARM Mali GPU by means of OpenCL - Part 2

Porting a Graphics Engine to the Vulkan API

$
0
0

The new Vulkan API from Khronos is the first to be designed to work across all platform types. Vulkan is a very low-level API that gives developers explicit control of *everything*.

 

In this blog we want to share with developers our experience of porting a graphics engine to the Vulkan API. The engine we decided to port to Vulkan has already had support for various other standard APIs such as OpenGL 4.5, OpenGL ES 2.0, OpenGL ES 3.2. Although these APIs are all different, they have some characteristics in common, for example that an application cannot efficiently talk with the driver from multiple threads. Implementing multi-threading in those previous APIs would not improve performance as it would require a lot of context switches which would eat up most of the performance budget. The GL APIs were designed for previous generations of GPU hardware and whilst the capabilities of hardware and technology evolved, the API’s haven’t done so at the same rate. Vulkan therefore comes at just the right time to address most of the issues of the existing APIs. It also provides developers with a lot more control and flexibility over resources and how and when “jobs” are executed in the graphics pipeline. It does however mean that there isn’t the level of support developers might have come to expect in working with Open GL ES but there are Vulkan debug layers available for plug in that can do a sense check and keep you on the right track.

 

Another really important advantage from Vulkan is memory management. You might have wondered how buffers and textures are handled by the GPU pipeline and at what points they are actually allocated and destroyed. If you assumed they are allocated and deallocated at the same time when you call function glTexImage2D, glBufferData or glDeleteTexture then you might be surprised. The behavior of memory management depends on the vendor, GPU and pipeline architecture. The desktop pipeline, which is typically immediate, is different from mobile (deferred). In OpenGL the memory management is hidden from the programmer and the driver decides when is the best time to allocate memory for a resource. Just as an immediate architecture may be simpler for handling memory, deferred is more complicated as it requires retaining buffers across multiple frames.

 

 

If you want to learn more about deferred pipeline architecture please see my colleague’s blog:

https://community.arm.com/groups/arm-mali-graphics/blog/2014/02/03/the-mali-gpu-an-abstract-machine-part-1

 

For example, if you have a dynamic buffer that you allocate and deallocate within one frame, the driver would have to keep a reference to this data until it’s no longer used in the pipeline. For an immediate mode architecture this might be 1-2 frames, while a deferred mode architecture might need to keep track of this for 2-4 frames. Handling textures may require keeping a reference for even more frames, since the processing sits deeper in the pipeline. This requires maintaining more memory and complicates tracking the data. In the Vulkan API, all memory management is more explicit. The developer allocates and deallocates memory which needs to be bound to Vulkan objects in order to be useful. This also opens up the possibility to alias memory between, for instance, a texture and a buffer.

It’s also up to the developer to ensure that memory that is used by the pipeline is valid throughout execution – this means that you cannot free memory that is bound to the pipeline. Luckily Vulkan also provides a means to check when a given command buffer has been completed and decide whether to free or recycle memory.

 

First attempt

Okay, let’s get to the nitty gritty of the porting experience and how we approached it. As mentioned earlier about multithreading issues - our engine was single threaded. Therefore, in the first approach  we designed the Vulkan backend on the same model as our existing OpenGL backend. Effectively we just implemented another API without changing any of the existing structure of the engine. In order to achieve that we created a default command buffer to which we internally recorded commands, but all still on one thread. We also had to create several objects such as Pipelines and Pipeline layouts which were reused for subsequent similar drawcalls. On the other hand, we called vkQueueWaitIdle after every vkQueueSubmit which caused stalls and compromised performance. That, however, wasn’t our priority as the main goal was to just get the engine backend up and running on Vulkan.

 

In the engine we had a custom shader language that we can translate for different standards: HLSL, GLSL and GLSL ES. Basically just yet another shader cross compiler. The Vulkan API comes with the new SPIR-V language so that was another significant chunk of work to be done. First we had to cross compile using our custom compiler and then compile this down to SPIR-V and feed it to the Vulkan driver. What we found interesting in compiling this down to SPIR-V was being able to extract binding information directly from the SPIR-V binary blob in order to create compatible pipeline layouts.

 

In GL, a framebuffer doesn’t have a clear beginning and end. The only way you can tell that with GL is SwapBuffer, clearing/discarding or by switching framebuffer objects. The Vulkan API addresses this by having the developer explicitly declare when a drawing begins and ends. The initial implementation we did was mimicking the GL behavior so that when we either called SwapBuffer or changed framebuffer, we ended the render pass and started a new one. This was obviously a temporary solution and definitely not what we would call best practice, but it helped us get the engine working to the same level as OpenGL with any content, except maybe for performance!

 

Next steps

Having the engine working on the Vulkan API as described above doesn’t give you better performance or the benefit of any of the cool features of Vulkan. In order to improve upon the process described above, we took the following steps to make the engine work more efficiently and to properly utilize Vulkan features.

 

We made the following key decisions:

  • Sacrifice some OpenGL performance to get the best out of Vulkan
  • Implement block based memory allocator
  • Implement new command buffer abstraction
  • Implement new explicit render pass interface
  • Standardize on SPIR-V for all our shaders

 

Sacrifice some OpenGL performance to get the best out of Vulkan

The first step was redesigning the engine backend and, more specifically, the structures and functionality which were responsible for interfacing with the driver. Unfortunately, once we ported to Vulkan, content running on OpenGL performed slightly worse than previously. This is mainly due to descriptor set emulation that we implemented in the OpenGL backend. This means that for our OpenGL backend we added support for allowing multiple sets in the shaders. The SPIR-V cross compiler removes the sets in the OpenGL GLSL output and replaces it with a serialized list of binding locations together with a remapping table. The descriptor set emulation in OpenGL then has to rebind and remap textures, samplers, uniform buffers, etc depending on the remapping table. This means that we get a slight performance hit if we have lots of shaders using wildly different binding tables because of the API overhead of rebinding.

 

But this is a decision that you will have to make for you graphics engine. We wanted to support descriptor sets at a high level, but this can also be hidden at a lower level in your engine so that you don’t have to think about handling descriptor set remapping in your OpenGL implementation.

 

Block based memory allocator

We added a block based memory allocator where we allowed two types of allocations: “static” and “transient.” The static memory is allocated for either a scene or the engine lifetime. The transient allocation type on the other hand, is a memory which lasts one frame at the most. Transient memory is allocated and tracked by our memory manager. We decided to track this memory for the same amount of frames as we created in our swap chain. This means that we can only need to track entire blocks of memory which simplifies a lot of things.

We also implemented constant data memory updates using the block based memory allocator but for this we made ring frame allocators that can be pinned to specific threads in order to allow efficient uniform data updates across multiple threads.

Device memory is memory that can be bound to Vulkan objects. You may be familiar with VBO, PBO, UBO and so on, implemented in the OpenGL API. Those were designed to map/unmap memory on the CPU. Once the memory has been mapped, the CPU could access, write and read the data allocated by the GPU. In Vulkan this is more generic and it can be anything like VBO, PBO, textures, it can also be a Framebuffer for instance.

 

Command buffer abstraction

As Vulkan is capable of recording to command buffers on multiple threads, we made a decision to make explicit abstractions for command buffers. We implemented this as a thin wrapper which pushes calls to the Vulkan command buffer object.

The command buffer abstraction also keeps track of state locally per command buffer, and the wrapper flushes the commands when required. By saying flushing we do not mean flushing the graphics pipeline as was the case in OpenGL. What we mean is creating any necessary Vulkan state objects. In other words, we group all “draw calls” with similar states and as long as the draw calls are of the same state we put them into one batch. Once a “draw call” is coming with different states we flush the current state which means we create new Vulkan objects. Note that we only create new pipeline objects in case we don’t already have an equivalent one in the cache.

There are a lot of Vulkan objects that are created on-demand, such as:

  • vkPipeline
  • vkPipelineLayout
  • vkFramebuffer
  • vkRenderPass

 

Some of these (like vkPipeline) need to be created during command buffer recording, which means it needs to be thread-safe and efficient. In order to keep it running efficiently we reduced lock contention by spreading the caches over a number of threads. This ensured that we could create state objects during runtime without hitting performance bottlenecks.

vkFramebuffer and vkRenderPass need to be created depending on the render target to which you’re rendering so there are dependencies between different Vulkan objects that you need to track and ensure you handle correctly. The ultimate solution would be to create all state upfront, as that would mean you could just focus the graphics engine on feeding the GPU with existing data.

 

Explicit render pass interface

We added an explicit render pass interface. The render pass is one of the most interesting features of the Vulkan API as it basically defines where the rendering content is going to end up. You also explicitly declare what to load and what to store before and after a render pass. It also has more powerful capabilities such as rendering sub-passes. Sub-passes are basically a more vendor-neutral version of Pixel Local Storage. Sub-passes allow the shader to read attachments written by the previous pass at the same pixel location. You may read more about Pixel Local Storage here: https://community.arm.com/groups/arm-mali-graphics/blog/2014/04/01/pixel-local-storage-on-arm-mali-gpus

 

Standardize on SPIR-V for all our shaders

We found it beneficial to write a library as we can then parse and retrieve information about input and output bindings. Then we can reuse this information for pipeline layout creation. It can also be very useful for re-using already existing pipeline layouts which are compatible across different shaders buckets.

 

Conclusion

As a result of porting the engine to Vulkan, we have created the Metropolis demonstration which showcases Vulkan’s benefits. The demo is at the very early stage of development. As we are going to implement more features for GDC and then SIGGRAPH, you will see the obvious benefits of the API as we issue from 2-3k draw calls per frame and simultaneously produce dynamic cascaded shadows.

There are more features of Vulkan that we will cover in a future blog post:

  • Vulkan as a power saving API
  • Push constants
  • Compute
  • Descriptor set handling
  • Pipeline barriers
  • Handling image transitions

 

 

This demo includes assets which have been created under licence from TurboSquid Inc.

Mali Graphics Debugger V3.4 Released with Vulkan Support

$
0
0

This is a very exciting time for Graphics. Today marked the release of Vulkan, which is the latest graphics API to be developed by the Khronos group. It has been completely redesigned from the ground up with a much greater emphasis on low-level efficiency. Some of the key features include better multi-threading support, reduced CPU workloads and for the first time, desktop and mobile will use the same graphics API. This last feature will make generating content for both mobile and desktop even easier. More information about Vulkan can be found in the community ARM Vulkan: Built for Modern Mobile Systems and on the Khronos Website

 

As it is a completely new API we appreciate there could be a learning curve required. Help is at hand though as the Mali Graphics Debugger now has Vulkan support. This means that you will be able to trace your Vulkan applications and see what function calls are being used in your application and in what order. It will also detect end-of-frame events so that you can see what function calls were involved in making each frame.

 

simpleTriangle.pngvulkanMgdUi.png

 

Keep checking the ARM Connected Community as over the coming few months the MGD Engineers will be working hard to implement even more features to make your transition from OpenGL ES to Vulkan easier. We'd also love to hear your thoughts below on how to make Vulkan support in MGD be as good as it can possibly be.

Omate's Mali-based Smartwatch for Kids Gives Peace of Mind Parenting

$
0
0

Last week, at MWC, ARM® partner Omate® announced their first standalone smartwatch designed especially for kids, with their brand-new Omate Wherecom K3. Featuring a 3G Dual-core 1GHz ARM Cortex®-A7 and an ARM Mali™-400 GPU, the K3 achieves the low power and area requirements essential for wearables whilst also delivering all the key features. omate1.jpg

Designed for six to nine year olds, the K3 is a standalone smartwatch with 3G and Wi-Fi capabilities, a capacitive touchscreen with a child-friendly Wherecom user interface running on top of Android 5.1. In collaboration with security and location specialists Monitorlinq, the K3 incorporates GPS technology with communications features such as 3G and Wi-Fi hotspot usage to triangulate location for greater accuracy. This allows parents better-than-ever location services as well as instant communication options for the ultimate in peace of mind whenever the kids are away from home.

omate2.jpg

Laurent Le Pen, CEO of Omate, explained the reasoning behind Omate’s choice of ARM power for their devices:

 

ARM is an established name in processor IP and it made sense for us to complement the ARM Cortex A-7 with the Mali-400 GPU for the ease of integration between the products as well as unparalleled power saving features. Achieving a high quality user experience in the smallest area and with the best possible battery life was a key consideration for us and the Mali-400 helps us deliver that.

 

Mali has recently been recognised as the highest shipping GPU in the world and the Mali-400 is still ARM’s most licensed model. The combination of optimum power savings and silicon area efficiency of the Mali-400 GPU make it the ideal choice to suit the low power, low cost requirements of the wearables market.

 

The Mali family of GPUs is the most flexible and scalable on the market, ranging from the highly efficient Mali-400 up to the Mali-T880, the high power, high efficiency option for high end smartphones, tablets and DTVs. It’s not all about the GPU however, the combined Mali Multimedia Suite (MMS) of GPUs, video processors and display processors is pre-optimized to work better together and reduce time to market. The latest generation of ultra-low power Mali GPUs is well placed to continue the rise of high end wearables so stay tuned for the appearance of the Mali-470 in devices soon.

 

Launched off the back of a million dollar plus Kickstarter campaign back in 2013, Omate now has a whole range of smart wearables. From heavy duty sports watches like the recently released Omate Rise, to fashion conscious design styles like the feminine Lutetia, and now this addition of kids products, Omate really does have something for everyone. They even pioneered companion jewellery pieces; in collaboration with French fashion house Ungaro, which allow you to receive instant vibration alerts when your ‘favourite’ contact gets in touch. With such a fast growing range, and with the Omate Rise funded in just fifteen minutes on Indiegogo, this is a company to watch and one that ARM looks forward to working with in the future.

Virtual Reality – Blatant Latency and how to Avoid it

$
0
0

Welcome back to the third post in my Virtual Reality (VR) blog series. We’ve already discussed the reasoning behind our interest in VR as a major driving force for graphics development, but even in the time between posts, VR uptake has grown tangibly and it’s become apparent that the industry is moving ahead fast. It’s only a matter of time now until VR is totally mass market. Yes, it will probably start with high end gaming but is set to grow so quickly that the time to take notice is now.

 

The what and the why

In blog two we considered the difficulty of focus when developing for VR and some of the ways of managing this. This time around we’re looking at how to develop low latency VR. The latency measurement in question is the time it takes to turn the actual motion of your head into the image you see on the screen of your VR headset. The two events need to be close enough together that you don’t notice any disconnect – just like in the real world. If latency is too high or too variable the immersion feels unnatural and the disparity with your brain’s understanding of normal movement will start to cause nausea or dizziness – not a super fun experience. Industry research tell us that this “motion-to-photons” latency should be consistently less than 20 milliseconds (ms) for a smooth and natural VR experience. This is a tough call at a standard refresh rate of 60Hz, which would give a latency of 16ms, but is attainable with the right approach.

 

The winning combination

There are a few elements which, when combined, can contribute greatly to successfully building a low latency VR system. The first aspect we will consider is front buffer rendering. Double or triple buffering is commonly used in graphics applications, including Android, in order to increase smoothness by allowing the GPU to draw pixels to a rotating buffer of off-screen copies, swapping them with the  on-screen buffer at the end of each display refresh. This process helps iron-out variations between neighbouring frame times but also has the added effect of increasing latency which is of course the opposite of what we are looking for in a VR application. In front buffer rendering, the GPU is able to bypass the off-screen buffers and render directly to the buffer the display reads from in order to reduce latency. Rendering to the front buffer needs careful synchronisation with the display to ensure the GPU is always writing ahead of the display reading. The context_priority extension available on Mali GPUs is used to enable prompt scheduling of tasks on the GPU in order to allow front buffer rendering processes such as Timewarp to take priority over less immediately urgent tasks and improve user experience.

Front buffer.jpg

Another important part of this puzzle is selecting the right type of display for your VR device. Organic Light Emitting Diode (OLED) displays are an additional element that can improve a VR experience and they work differently from the familiar and well-established LCD display. Each and every pixel in an OLED display provides its own light source through the Thin Film Transistor array which sits behind it, as opposed to the white LED backlighting of an LCD. The brightness of an OLED display is established by the level of power of the electron current travelling through the film. Because colours are managed by individually varying the tiny red, green and blue LEDs behind the screen, it is possible to get brighter and sharper hues with stronger saturation. You can turn off sections of the panel so you can achieve a deeper, truer black than is typically possible on an LCD which has to block out the back light. This is the usual selling point for OLED panels, but critically for VR it also allows an easier means to achieve low persistence through partial illumination. A full persistence display is lit continuously which means the scene view is correct only briefly then is very quickly out of date. A low persistence approach means the image only remains lit while the view is correct and then goes dark, a process which is imperceptible at a high refresh rate, providing the illusion of a continuous image.

  low persist screens.jpg

This is important to reducing blurring for the user. The additional flexibility with which you can illuminate the pixels in an OLED panel means that the display can show multiple partial images during a single refresh and so react mid-frame to the changes fed to it by the sensors in the headset, allowing the head location to change whilst the scan is moving across the screen. This is not possible to achieve with an LCD panel without replacing the global backlight. The ability to drive an OLED panel by drawing to the front buffer in sections or slices through a Timewarp-like process is key for achieving a lower latency VR experience. This is because the image you see on screen can adapt to your head movements much more quickly than is otherwise possible.

  Low_persistence.jpg

Why warp?

Now we consider one of the keystones of this combination, the Timewarp process. Due to the comparatively gradual changes of scene in an immersive VR application, the image changes between views by a small and therefore, relatively predictable amount. Warping is basically shifting an image rendered at an older head location to match a newer one. This partially decouples the application frame rate from the refresh rate and allows the system to enforce a latency guarantee that some applications may not provide. This shifting can account for changes in head rotation but not head position or scene animation. It’s therefore something of an approximation but provides an effective safety net and also enables an application running at 30 FPS to appear (at least in part) as if it’s tracking the users head at 60 FPS or above.

  warping.jpg

Tricks of the trade

In this post we've discussed the tight integration that needs to exist between GPU and display, but this is only one part of the stack. If we are to play videos, perhaps DRM-protected ones, and integrate system notifications; the complexity doesn’t end there. High quality VR support requires that your multimedia products are well synchronized and communicate in bandwidth-efficient ways in order to provide not only the best experience to the end user but also the best power efficiency and performance. The ARM® Mali™ Multimedia Suite (MMS) of GPU, Video and Display processors is integrated and supported by efficiencies such as ARM Frame Buffer Compression (AFBC) and ARM TrustZone® making it a leader in VR development technology.

 

Join us at GDC to find out more!

Viewing all 266 articles
Browse latest View live