Epic Giveaway 2014: Start the New Year with a Samsung Galaxy Note 4

January 3, 2015, 3:03 am

≫ Next: Using DS-5 Streamline to Optimize Complex OpenCL™ Applications on Mali GPUs

≪ Previous: Stereoscopy Without Headaches

The Samsung Galaxy Note 4 has quickly made itself popular among the ARM Mali graphics team. And it’s not its 515PPI Quad HD Super AMOLED display,

vivid colors and intuitive UI that has earned it a place in our hearts – it is, as you would expect from ARM engineers, its stunning processor that has caught our eyes.

The Samsung Exynos 7 Octa is the latest mobile application processor to come out of the Samsung LSI team and it boasts considerable improvements over the previous generation – including up to 74% enhanced graphics performance with the ARM Mali-T760 MP6 GPU. With this boost the Samsung Galaxy Note 4 can deliver superior, more life-like 3D gaming experiences on super HD screens, as well as a smoother, more responsive user interface and performance intensive, up-and-coming applications such as instant image stabilization, video editing or facial recognition. The Mali-T760 incorporates many of ARM’s popular energy efficient technologies, such as ARM Frame Buffer Compression, Adaptive Scalable Texture Compression, Smart Composition and Transaction Elimination – these together with the micro-architectural improvements to the Mali-T760, in particular to the L2 cache interconnect, result in an Exynos SoC that delivers a fantastic graphics experience without overexerting its intrinsic thermal and

power budget.

When combined with a 1.9GHz ARM Cortex-A57 MP4, a 1.3GHz Cortex-A53 MP4 processor in big.LITTLE™ configuration with Samsung's HMP (Heterogeneous Multi-Processing) solution, every process can intelligently use the high processing power in such a way that no matter what the multitasking needs are, or what application is being run, there will be no lags and ultimately no unnerving power consumption. In all, the HMP technology, when used with the Cortex-A57 cores and Cortex-A53 cores, provides a 57% CPU performance increase from the previous generation Exynos 5 Octa.

The sum of all this is a device that has not only impressed across a range of benchmarks but also delighted critics and the public at large. The Samsung Galaxy Note 4 is an extremely desirable device that delivers the very latest advances in mobile technology – and it can be yours! ARM is giving away a Samsung Galaxy Note 4 as part of the 2014 Epic Giveaway in partnership with HEXUS. To find out more and to enter the EPIC Giveaway for your chance to win, click here and go to HEXUS’ Facebook page.

The 2014 Epic Giveaway is underway today. In partnership with HEXUS, ARM is giving you the chance to win amazing new prizes this holiday season! Every day for the next few weeks, we'll be giving away a brand-new ARM-based device. We'll have an array of prizes from ARM and our partners, including Atmel, Nvidia and Samsung, plus many, many more! Each prize draw will be open for seven days, so visit the dedicated competition page to keep tabs on what's up for grabs and what's coming soon.

↧

Using DS-5 Streamline to Optimize Complex OpenCL™ Applications on Mali GPUs

January 23, 2015, 7:17 am

≫ Next: No Such Thing as Free Performance

≪ Previous: Epic Giveaway 2014: Start the New Year with a Samsung Galaxy Note 4

Heterogeneous applications – those running code on multiple processors like a CPU and a GPU at the same time – are inherently difficult to optimize. Not only do you need to consider how optimally the different parts of code that run on the different processors are performing, but you also need to take into account how well they are interacting with each other. Is either processor waiting around unnecessarily for the other? Are you copying large amounts of memory unnecessarily? What level of utilisation are you making of the GPU? Where are the bottlenecks? The complexities of understanding all these are not for the squeamish.

Performance analysis tools are, of course, the answer, at least in part. DS-5 Streamline performance analyzer is one of these tools and recently saw the addition of some interesting new features targeting OpenCL. Streamline is one of the components of ARM DS-5 Development Studio, the end-to-end suite of tools for software development on any ARM processor.

So, armed with DS-5 Streamline and a complex, heterogeneous application how should you go about optimization? In this blog I aim to give you a starting point, introducing the DS-5 tool and a few concepts about optimization along the way.

DS-5 Streamline Overview

DS-5 Streamline allows you to attach to a live device and retrieve hardware counters in real time. The counters you choose are displayed in a timeline, and this can include values from both the CPU and GPU in the same trace. The image above, for example, shows a timeline with a number of traces. From the top there’s the dual-core CPU activity in green, the GPU’s graphics activity in light blue and the GPU’s compute activity in red. Following that are various hardware counter and other system traces.

As well as the timeline, on the CPU side you can drill down to the process you want to analyse and then profile performance within the various parts of the application, right down to system calls. With Mali GPUs you can specify performance counters and graph them right alongside the CPU. This allows you to profile both graphics and OpenCL compute jobs, allowing for highly detailed analysis of the processing being done in the cores and their components. A recently added feature, the OpenCL timeline, takes this a step further making it possible to analyse individual kernels amongst a chain of kernels.

Optimization Workflow

So with the basics described, what is the typical optimization process for complex heterogeneous applications?

When the intention is to create a combined CPU and GPU solution for a piece of software you might typically start with a CPU-only implementation. This gets the gremlins out of the algorithms you need to implement and then acts both as a golden reference for the accuracy of computations being performed, and as a performance reference so you know the level of benefit the move to multiple processor types is giving you.

Often the next step is then to create a “naïve” port. This is where the transition of code from CPU to GPU is functional but relatively crude. You wouldn’t necessarily expect a big – or indeed any – uplift in performance at this stage, but it’s important to establish a working heterogeneous model if nothing else.

At this point you would typically start thinking about optimization. Profiling the naïve port is probably a good next step as this can often highlight the level of utilisation within your application and from there you can deduce where to concentrate most of your efforts. Often what you’re looking for at this stage is a hint as to the best way to implement the parallel components of your algorithm.

Of course to get the very best out of the hardware you’re using it is vital to have a basic understanding at least of the architecture you are targeting. So let’s start with a bit of architectural background for the Mali GPU.

The OpenCL Execution Model on Mali GPUs

Firstly, here’s how the OpenCL execution model maps onto Mali GPUs.

Work items are simply threads on the shader pipeline, each one with its own registers, program counter, stack pointer and private stack. Up to 256 of these can run on a core at a time, each capable of natively processing vector data.

OpenCL work groups – collections of work items – also work on an individual core. Workgroups can have barriers, local atomics and cached local memory.

The ND Range, the entire work load for an OpenCL job, splits the workgroups up and assigns them around the available Mali GPU cores. Global atomics are supported, and we have cached global memory.

As we’ll see, relative to some other GPU architectures, Mali GPU cores are relatively sophisticated devices capable of handling hundreds of threads in flight at any one time.

The Mali GPU Core

Let’s take a closer look inside one of these cores:

Here we see the dual ALU, the load/store and the texture pipelines. Threads come in at the top and enter one of these pipes, circle round back up to the top for the next instruction until the thread completes, at which point it exits at the bottom. We would typically have a great many threads running this way spinning around the pipelines instruction by instruction.

Load/Store

So let’s imagine the first instruction is a load. It enters and is executed in the load/store pipe. If the data is available, the thread can loop round on the next cycle for the next instruction. If the data hasn’t yet arrived from main memory, the instruction will have to wait in the pipe until it’s available.

ALUs

Imagine then the next instruction is arithmetic. The thread now enters one of the arithmetic pipes. ALU instructions support SIMD – single instruction, multiple data – allowing operations on several components at a time. The instruction format itself is VLIW – very long instruction word – supporting several operations per instruction. This could include, for example, a vector add, a vector multiply and various scalar operations all in one instruction. This can give the effect of certain operations appearing “as free” because the arithmetic units within the ALU can perform many of these in parallel within a single cycle. Finally there is a built in function library – the “BIFL” – which has hardware acceleration for many mathematical and other operational functions.

So this is a complex and capable core, designed to keep many threads in flight at a time, and thereby hide latency. Latency hiding is what this is ultimately all about. We don’t care if an individual thread has to wait around for some data to arrive as long as the pipelines can get on with processing other threads.

Each of these pipelines is independent from the other and likewise the threads are entirely independent from other threads. The total time for a program to be executed is then defined by the pipeline that needs the most cycles to let every thread execute all the instructions in its program. If we have predominantly load/store operations for example, the load/store pipe will become the limiting factor. So in order to optimize a program we need to find which pipeline this is allowing us to target optimization efforts effectively.

Hardware Counters

To help determine this we need to access the GPU’s hardware counters. These will identify which parts of the cores are being exercised by a particular job. In turn this helps target our efforts towards tackling bottlenecks in performance.

There are a large number of these hardware counters available. For example there are counters for each core and counters for individual components within a core, allowing you to peek inside and understand what is going on with the pipelines themselves. And we have counters for the GPU as a whole, including things like the number of active cycles.

Accessing these counters is where we come back to DS-5 Streamline. Let’s look at some screenshots of Streamline at work.

The first thing to stress is that what we see here is a whole-system view. The vertical green bars in the top line shows the CPU, the blue bars below that show the graphics part of the application running on the GPU, and the red bars show the compute-specific parts of the application on the GPU.

There are all sorts of ways to customise this – I’m not going to go into huge amounts of detail here, but you can select from a wide variety of counter information for your system depending on what it is you need to measure. Streamline allows you to isolate counters against specific applications for both CPU and GPU, allowing you to focus in on what you need to see.

Looking down the screen you can see an L2 cache measurement - the blue wavy trace in the middle there - and further down we’ve got a counter showing activity in the Mali GPU’s arithmetic pipelines. We could scroll down to find more and indeed zoom in to get a more detailed view at any point.

DS-5 Streamline can often show you very quickly where the problem lies in a particular application. The next image was taken from a computer vision application running on the CPU and using OpenCL on the GPU. It would run fine for a number of seconds, and then seemingly randomly would suddenly slow down significantly, with the processing framerate dropping in half.

You can see the trace has captured the moment this slowdown happened. To the left of the timeline marker we can see the CPU and GPU working reasonably efficiently. Then this suddenly lengthens out, we see a much bigger gap between the pockets of GPU work, and the CPU activity has grown significantly. The red bars in amongst the green bars at the top represent increased system activity on the platform. This trace and others like it were invaluable in showing that the initial problem with this application lay with how it was streaming and processing video.

One of the benefits of having the whole system on view is that we get a holistic picture of the performance of the application across multiple processors and processor types, and this was particularly useful in this example.

Here we’ve scrolled down the available counters in the timeline to show some others – in particular the various activities within the Mali GPU’s cores. You can see counter lines for a number of things, but in particular the arithmetic, load-store and texture pipes – along with cache hits, misses etc. Hovering over any of these graphs at any point in the timeline will show actual counter numbers.

Here for example we can see the load/store pipe instruction issues at the top, and actual instructions on the bottom. The difference in this case is a measure of the load/store re-issues necessary at this point in the timeline – in itself a measure of efficiency of memory accesses. What we are seeing at this point represents a reasonably healthy position in this regard.

The next trace is from the same application we were looking at a little earlier, but this time with a more complex OpenCL filter chain enabled.

If we look a little closer we can see how efficiently the application is running. We’ve expanded the CPU trace – the green bars at the top – to show both the cores we had on this platform. Remember the graphics elements are the blue bars, with the image processing filters represented by the red.

Looking at the cycle the application is going through for each frame:

Firstly there is CPU activity leading up to the compute job.
Whilst the compute job then runs, the CPU is more or less idle.
With the completion of the compute filters, the CPU does a small amount of processing, setting up the graphics render.
The graphics job then runs, rendering the frame before the sequence starts again.

So in a snapshot we have this holistic and heterogeneous overview of the application and how it is running. Clearly we could aim for much better performance here by pipelining the workload to avoid the idle gaps we see. There is no reason why the CPU and GPU couldn’t be made to run more efficiently in parallel, and this trace shows that clearly.

OpenCL Timeline

There are many features of DS-5 Streamline, and I’m not going to attempt to go into them all. But there’s one in particular I’d like to show you that links the latest Mali GPU driver release to the latest version of DS-5 (v5.20), and that’s the OpenCL Timeline.

In this image we’ve just enabled the feature – it’s the horizontal area at the bottom. This shows the running of individual OpenCL kernels, the time they take to run, any overhead of sync-points between CPU and GPU etc.

Here we have the name of each kernel being run along with the supporting host-side setup processes If we hover over any part of this timeline…

… we can see details about the individual time taken for that kernel or operation. In terms of knowing how then to target optimizations, this is invaluable.

Here’s another view of the same feature.

We can click the “Show all dependencies” button and Streamline will show us visually how the kernels are interrelated. Again, this is all within the timeline, fitting right in with this holistic view of the system. Being able to do this – particularly for complex, multi-kernel OpenCL applications is becoming a highly valuable tool for developers in helping to understand and improve the performance of ever-more demanding applications.

Optimizing Memory Accesses

So once you have these hardware counters, what sort of use should you make of them?

Generally speaking, the first thing to focus on is the use of memories. The SoC only has one programmer controlled memory in the system – in other words, there is no local memory, it’s all just global. The CPU and GPU have the same visibility of this memory and often they’ll have a shared memory bus. Any overlap with memory accesses therefore might cause problems.

If we want to shift back and forth between CPU and GPU, we don’t need to copy memory (as you might do on a desktop architecture). Instead, we only need to do cache flushes. These can also take time and needs minimising. So we can take an overview with Streamline of the program allowing us to see when the CPU was running and when the GPU was running, in a similar way to some of the timelines we saw earlier. We may want to optimize our synchronisation points so that the GPU or CPU are not waiting any longer than they need to. Streamline is very good at visualising this.

Optimizing GPU ALU Load

With memory accesses optimized, the next stage is to look more closely at the execution of your kernels. As we’ve seen, using Streamline we can zoom into the execution of a kernel and determine what the individual pipelines are doing, and in particular determine which pipeline is the limiting factor. The Holy Grail here – a measure of peak optimization – is for the limiting pipe to be issuing instructions every cycle.

I mentioned earlier that we have a latency-tolerant architecture because we expect to have a great many threads in the system at any one time. Pressure on register usage, however, will limit the number of threads that can be active at a time. And this can introduce latency issues once the number of threads falls sufficiently. This is because if there are too many registers per thread, there are not enough registers for as many threads in total. This manifests itself in there being too few instructions being issued in the limiting pipe. And if we’re using too many registers there will be spilling of values back to main memory, so we’ll see additional load/store operations as a result. The compiler manages all this, but there can be performance implications of doing so.

An excessive register usage will also result in a reduction in the maximum local workgroup size we can use.

The solution is to use fewer registers. We can use smaller types – if possible. So switching from 32 bit to 16 bit if that is feasible. Or we can split the kernel into multiple kernels, each with a reduced number of registers. We have seen very large kernels which have performed poorly, but when split into 2 or more have then overall performed much better because each individual kernel needs a smaller number of registers. This allows more threads at the same time, and consequently more tolerance to latency.

Optimizing Cache Usage

Finally, we look at cache usage. If this is working badly we would see many L/S instructions spinning around the L/S pipe waiting for the data they have requested to be returned. This involves re-issuing instructions until the data is available. There are GPU hardware counters that show just what we need, and DS-5 can expose them for us.

This has only been a brief look at the world of compute optimization with Mali GPUs. There’s a lot more out there. To get you going I’ve included some links below to malideveloper.arm.com for all sorts of useful guides, developer videos, papers and more.

Download DS-5 Streamline: ARM DS-5 Streamline - Mali Developer Center Mali Developer Center

Mali-T600 Series GPU OpenCL Developer Guide: Mali-T600 Series GPU OpenCL Developer Guide - Mali Developer Center Mali Developer Center

GPU Compute, OpenCL and RenderScript Tutorials: http://malideveloper.arm.com/develop-for-mali/opencl-renderscript-tutorials/

↧

No Such Thing as Free Performance

February 2, 2015, 6:01 am

≫ Next: Research papers discussing ASTC on ARM Mali GPUs

≪ Previous: Using DS-5 Streamline to Optimize Complex OpenCL™ Applications on Mali GPUs

I spent much of my teenage years sat in front of a monitor with a keyboard and mouse blasting away friends on the other side of the town in a bit of first person shooter action. The rest of the time I would be thinking about what graphics card I would need to run the same game at a slightly higher resolution or with more effects enabled. If I had two cards, would that double the frame rate giving me some kind of edge on my friends? The arms race in discreet graphics cards was always about delivering the ultimate performance no matter the cost – and ultimately the only cost was that on the consumer’s wallet.

Coming back to today’s world, a significant amount of gaming played on mobile devices. In the mobile GPU space, it’s very easy to be drawn into a similar battle for ultimate performance when comparing GPUs. But here there is one cost that is critical: power or, more specifically, thermal limits. The GPU will always keep giving performance, but if you cannot sustain that performance due to the thermal constraints of a mobile platform, there is little point in having that performance available. Not only that, but you would also want to use your phone to make a call, chat with friends, check e-mails etc, after a few hours of heavy gaming without having to worry about your battery running low.

When ARM talks about graphics performance, we specifically use the term energy efficiency, or delivering the maximum performance within this constrained thermal budget. It’s worth pointing out at this point that the “constrained” thermal budget never increases (~2.5W for total SoC power in a high-end smartphone that also needs to include other components such as CPU, memory etc) so the only way we can keep up with the curve in terms of performance requirements for the latest content is to keep the Mali GPU architecture constantly evolving with new innovative technologies and optimizations.

Looking at the latest high-end GPU from ARM, the ARM® Mali™-T860, we improved energy efficiency by 45% compared to the Mali-T628 across a wide range of content. That means it is able to deliver 45% more performance within the same thermal budget. The comparison is core for core in the same process node. In reality, as the industry moves forward with process nodes, we see even greater improvements in energy efficiency in end devices.

From generation to generation the Mali Midgard GPU family has made step improvements in energy efficiency. These have come both from innovative bandwidth reduction technologies such as AFBC (ARM Frame Buffer Compression) or Transaction Elimination (Jakub Lamik's recent blog titled Should Have Gone to Bandwidth Savers covers these technologies and more in extensive detail) and micro architectural optimizations designed around the content we run every day.

Looking at the Mali-T860 GPU we recently launched. ARM focused on real life use cases such as high-end gaming, casual gaming and the user interface for its hardware improvements. Optimizations like quad-prioritization result in significant efficiency improvements for casual gaming and user interface. Given that users spend a large proportion of time playing these types of games or navigating between applications on a device, we feel it is extremely important to focus on such use cases and ensure we are able to handle them in an energy efficient way. Ultimately, the user gets a smoother experience for longer.

Another optimization introduced in the Mali-T760 and enhanced in the Mali-T860 is Forward Pixel Kill. This feature reduces the amount of redundant processing the Mali GPU has to do when pixels are occluded. This is especially effective in applications that use inefficient draw ordering.

In summary, when comparing GPUs in our industry, performance alone is not a useful metric when energy efficiency is not included in the mix. Mali GPUs have been designed from the ground up to be extremely energy efficient not only within the GPU itself but also from a system wide perspective. We will continue to innovate in this area for each new generation of Mali GPU products.

↧

Research papers discussing ASTC on ARM Mali GPUs

February 6, 2015, 2:51 am

≫ Next: Rockchip RK3288 Solution

≪ Previous: No Such Thing as Free Performance

Pavel Krajcevski and Dinesh Manocha over at the University of North Carolina at Chapel Hill, have produced several papers recently that have discussed ASTC. They had a paper at High Performance Graphics 2014: SegTC: Fast Texture Compression using Image Segmentation in which they start with a good introduction and review of the state of the art of compressed textures and in particular methods of actually compressing those textures to be used later on GPUs. Needless to say, they mention ASTC as a significant advance over existing methods, which made us rather proud. They then go on to discuss new methods of compression of textures, by first computing a segmentation of the image into superpixels to identify homogeneous areas based on a given metric and using that to define partitionings for partition-based compression formats, including ASTC.

Their recent paper has been accepted for the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games. In this paper with the catchy title of Compressed Coverage Masks for Path Rendering on Mobile GPUs, they look at methods of accelerating resolution-independent curve-rendering on mobile GPUs, preferably in real time. They find using ASTC to be a significant advance in this area, seeing good speed-ups overall from using the compressed coverage masks but also much bigger memory footprint (and bandwidth, and thus power) savings compared to older methods such as ETC2. For those interested, the list of other papers is here.

Both papers are highly readable, and I encourage you to have a look. It's clear there is more work to be done in this area, particularly research into more efficient ways of compressing images (and other data) into ASTC, and we look forward to seeing it.

↧

Rockchip RK3288 Solution

February 7, 2015, 1:24 am

≫ Next: Premium Experiences on Mobile

≪ Previous: Research papers discussing ASTC on ARM Mali GPUs

This is an Android Stick I have tested. Any idea how many Cores GPU really have ? My guess Its 4..

http://extrabuy.me/2015/02/01/rikomagic-v5-quad-core-android-mini-pc-rk3288-review/

↧

Premium Experiences on Mobile

February 9, 2015, 11:00 pm

≫ Next: When Parallelism Gets Tricky: Accelerating Floyd-Steinberg on the Mali GPU

≪ Previous: Rockchip RK3288 Solution

Though they may be reluctant to acknowledge it openly, I think my three kids are quite fortunate. They are growing up in an environment where gaming is ubiquitous

- we have two consoles, each child has access to their own tablet, and there are a number of PCs dotted around as well. There are even board games for when they finally tire of looking at a screen. This offers an amazing range of opportunities, as well as an interesting case study of the preferences of the next generation.

It is clear that there is a natural hierarchy of preferences among my children, with mobile at the top. Console gaming goes in cycles depending on whether a new game is out, and the PC is still the dominant platform for Minecraft, but the kids have an attachment to their tablets that goes beyond the console or PC experience. I think this has a lot to do with a sense of ownership, and the feeling of holding a premium device in your hands. There is a unique tactile experience when playing on these devices that remains exciting.

(I should point out here that my youngest daughter is only 4 so has to share iPad time with her mum. However, she did manage to customise the home screen by neatly putting all of her games into a sub-folder, which was both impressive for a 4 year old, and disconcerting for her mum who thought she has deleted Candy Crush).

For me this underlines the first part of the premium mobile experience – the raw quality of the hardware. Even if the game you are playing is quite simple, or you are just consuming content, the feel of a premium device in your hand that is light, sleek and looks cool, remains a genuine experience.

The second aspect of the premium experience is the content itself. This has been slower in arriving as many developers are reluctant to make games that only run on the latest high-end devices. But this is starting to change as more of the console developers target mobile. These days a high-end console game is an enormously complex beast to put together. This is why many developers chose to license in technology to get their game made – few can afford to invest in their own technology across the board as well as employ the talented art teams needed to produce a top-end game. This level of complexity makes it very difficult to strip back a game so it will run on an older mobile device. The consequence of this is that few developers have simultaneously targeted console and mobile. Most have opted to make a console game and then hand over the task of a mobile port to a different studio, with some mixed results.

We are now approaching a point where this situation is about to change, and the driver for this change is the rapid progress being made in mobile hardware. The latest mobile devices promise multi-core performance, fast shared memory and a powerful GPU. You can see this in the stunning specifications for ARM’s latest Cortex-A72 CPU and Mali-T880 GPU designs. Together these represent another significant quality step in mobile area and in terms of performance and architecture they closely resemble the latest consoles – the PlayStation 4 and Xbox One. The arrival of desktop style graphics APIs on mobile is also making life easier for developers to target both platforms simultaneously.

There is also convergence in the other direction. Mobile devices are designed with connectivity in mind from the ground up. Without connectivity my tablet defaults back to being just a MP3 player. This has not always been the case with consoles, but the new generation are also built around connectivity and content sharing, often with second-screen functionality built in.

These trends are all helping to create an environment where developers can simultaneously target console and premium mobile devices with the same game, which is a very attractive proposition. For developers and publishers it offers a route out of the current mindset of free-to-play with in-game purchases, pay-to-win and irritating adverts. For me these do not define a premium experience. But if your game is targeted at those who are attracted to the latest premium devices, I think you have an audience ready to pay a sensible amount for a quality game that does not constantly ask the player for more money.

The key final step in making this vision a reality is having the technology to simultaneously deploy the same content on console and premium mobile. The main engine vendors, Epic and Unity, are already at this point, as are some of the main technology providers. Geomerics, now an ARM company, originally developed the real-time global illumination technology Enlighten for the PC and console space. Enlighten has been mobile ready for over a year now, and on the latest mobile devices is able to run with the same quality settings used on the new generation of consoles.

The possibilities for the next five years are spectacular, which brings me back to my first point. The natural adopters of premium mobile content are the current generation of kids growing up with these devices, and their minds are already set – they will continue to want the latest, fastest mobile platform which can play the best games out there.

↧

When Parallelism Gets Tricky: Accelerating Floyd-Steinberg on the Mali GPU

November 25, 2014, 8:48 am

≫ Next: MediaTek scales the mobile market with Mali™-T720

≪ Previous: Premium Experiences on Mobile

Embarrassingly Parallel

In the world of parallel computing when an algorithm can be easily split into multiple parallel jobs, where the output of each of the jobs doesn’t depend on the output of any other job, it is referred to as “Embarrassingly Parallel” or “Pleasingly Parallel”, whichever you prefer. The reason for this uncharacteristically prosaic terminology is perhaps inspired by the huge relief such algorithms must bring to the weary parallel compute developer who otherwise has to craft delicate inter-thread communication so that parallel jobs can share their results in whatever their algorithm defines as the correct order.

Let me give you a simple example of such a parallel-friendly algorithm. Convolution filters are certainly members of the embarrassingly parallel club. Imagine we have a large array of values:

An example of a convolution filter. Each transformed pixel value is created by multiplying its current value and the values of the pixels around it against a matrix of coefficients

Each pixel in the image is processed by summing a proportion of its original value with a proportion of the original values of the surrounding pixels. The proportion of each pixel usually depends on its proximity to the central pixel being processed. Crucially – and apparently embarrassingly – none of the calculations require knowledge of the result of any of the other calculations. This makes parallelizing much easier, because each of the pixel calculations can be performed in any order. Using a parallel compute API like OpenCL™, it is then easy to assign each pixel to a work item – or thread – and watch your convolution job split itself across the processing cores you have at your disposal.

This sort of example is a nice way to showcase parallel programming. It gets across the message of splitting your job into the smallest processing elements without getting bogged down with too many thorny issues like thread synchronization. But what of these problem – non-embarrassing – algorithms? How should we tackle those?

Well of course, there’s not one answer. Life is not that simple. So we need to resort to an example to showcase the sort of methods at your disposal.

A good one I came across the other day was the Floyd-Steinberg algorithm. This is the name given to an image dithering algorithm invented by Robert W Floyd and Louis Steinberg in 1976. It is typically used when you need to reduce the number of colours in an image and still retain a reasonable perception of the relative colour and brightness levels. This is achieved through pixel dithering. In other words, an approximation of the required colour in each area of the image is achieved with a pattern of pixels. The result becomes a trade-off: what you lose is the overall level of detail, but what you gain is a reasonable colour representation of the original image.

Here's an example:

Original image on the left. On the right the 256-colour dithered version.

In our particular example, we’re going to be looking at converting a grey-scale image – where each pixel can be represented by 256 different levels of grey – to an image only using black and white pixels.

Grey scale version on the left. 2-colour Floyd-Steinberg version on the right

What you can see in this example – and what Floyd and Steinberg discovered – is this concept of error diffusion, where an algorithm could determine the distribution of pixels from a limited palette to achieve an approximation of the original image.

The algorithm itself is actually quite simple, and indeed rather elegant. What you have are three buffers:

The original image
An error diffusion buffer
The destination image

The algorithm defines a method of traversing over an image and for each pixel determining a quantization error – effectively the difference between the pixel’s value and what would be the nearest suitable colour from the available palette. This determination is made by reference to both the pixel’s current colour and a value read from the error buffer – as written out by previous pixel calculations. And indeed a proportion of the error calculated for this pixel will then be propagated to neighbouring ones. Here’s how this works:

Step 1: a pixel from the source and a value from the error diffusion buffer are added. Depending on the result, a white or black pixel is written to the destination and an error value is determined.

Step 2: the value of err is split up and distributed back into the error distribution buffer into four neighbouring pixels.

The code for doing all this is actually quite simple:

for each y from top to bottom    for each x from left to right        val := pixel[x][y] + error_values[x][y]        if (val > THRESHOLD)            diff := val - THRESHOLD            dest[x][y] := 0xff                // Write a white pixel        else            diff := val            dest[x][y] := 0x0                  // Write a black pixel        error_values[x + 1][y    ]    += (diff * 7) / 16        error_values[x - 1][y + 1]    += (diff * 3) / 16        error_values[x    ][y + 1]    += (diff * 5) / 16        error_values[x + 1][y + 1]    += (diff * 1) / 16

This uses these three buffers:

pixel (the source grey-scale image, 1 byte per pixel)
dest (the destination black or white image, 1 byte per pixel)
error_values (the buffer used to hold the distributed error values along the way - 1 byte per pixel and initialised to all-zeros before starting).

The value of THRESHOLD would typically be set to 128.

So I hope you can see the problem here. We can’t simply assign each pixel’s calculation to an independent work item because we cannot guarantee the order that work items will run. In OpenCL the order of execution of work items – even the order of execution of work groups – is entirely undefined. As we progress left to right across the image, and then line by line down the image, each pixel is dependent on the output of 4 previous pixel calculations.

Embarrassingly Serial?

So is there any hope for parallelization here? On its own perhaps this algorithm is better tackled by the CPU. But imagine the Floyd-Steinberg filter was part of a filter chain, where there was significant benefit from running the other filters before and after this one on a GPU like the ARM® Mali™-T604.

Any move from GPU to CPU will require cache synchronisation, introducing a level of overhead

Here we would need two CPU/GPU synchronization points either side of the Floyd-Steinberg filter. These are potentially quite expensive. Not only do the caches need to be flushed back to main memory, but the CPU needs to be ready to take on this job, which could complicate other jobs the CPU might be doing. So if it was possible to get a GPU version running somehow, even if its processing time was a little slower than the CPU, there might still be some net benefit to the GPU implementation.

Let’s look at the algorithm again and see what we might be able to do. We can see that the only thing stopping an individual pixel being processed is whether its related error buffer value has been written to by all four pixels: the one to the left, and the three above as follows.

C2 depends on the results of four previous pixel calculations, B1, C1, D1 and B2

From the diagrams we can see that if we want to process pixel C2 we have to wait until B1, C1, D1 and B2 have finished as these all write vaues into C2’s error buffer location.

If we have a work-item per pixel, each work item would be having to check for this moment and ultimately you could have work items representing pixels quite low down in the image or far to the right that are simply waiting a long time. And if you fill up all the available threads you can run at the same time, and they’re all waiting, you reach deadlock. Nothing will proceed. Not good.

What we need to do is to impose some order on all this… some good old-fashioned sequential programming alongside the parallelism. By serializing parts of the algorithm we can reduce the number of checks a work item would need to do before it can determine it is safe to continue. One way to do this is to assign an entire row to a single work item. That way we can ensure we process the pixels in a line from left to right. The work item processing the row of pixels below then only needs to check the progress of this work item: as long as it is two pixels ahead then it is safe to proceed with the next pixel. So we would have threads progressing across their row of pixels in staggered form:

Each thread processes a horizontal line of pixels and needs to be staggered as shown here

Of course there are a few wrinkles here. First of all we need to consider workgroups. Each workgroup – a collection of work items, with each work item processing a line – needs to be run in order. So the work items in the first workgroup need to process the top set of lines. The next needs to process the set of lines below this and so on. But there’s no guarantee that workgroups are submitted to the GPU in sequential order, so simply using the OpenCL function get_group_id – which returns the numerical offset of the current workgroup – won’t do as a way of determining which set of lines is processed. Instead we can use OpenCL atomics: if the first work item atomically incremented a global value – and then this is used to determine the actual group of lines a workgroup processes – then we can guarantee the lines will be processed in order as they progress down the image.

Here’s a diagram showing how workgroups would share the load within an image:

Each workgroup processes a horizontal band of pixels. In this case the workgroup size is 128, so the band height is 128 pixels, with each work item (thread) processing a single row of pixels.

So for each line we need a progress marker so that the line below knows which pixel it is safe to calculate up to. A work item can then sit and wait for the line above if it needs to, ensuring no pixel proceeds until the error distribution values it needs have been written out.

Here’s the rough pseudo code for what the kernel needs to do…

is this the first work item in the workgroup?
{    atomic increment the global workgroup rider    initialize to zero the local progress markers
}

barrier        // All work items in this workgroup wait until this point is reached

from the global workgroup rider and the local work item id,
determine the line in the image we’re processing

loop through the pixels in the line we’re processing
{    wait for the work item representing the line above to    have completed enough pixels so we can proceed    do the Floyd-Steinberg calculation for this pixel    update the progress marker for this line
}

You may have spotted the next wrinkle in this solution. The local progress markers are fine for ensuring that individual lines don’t get ahead of themselves – with the exception of the first work item (representing the top line in the group of lines represented by this workgroup). This first line needs to only progress once the last line of the previous workgroup has got far enough along. So we need a way of holding markers for the last line of each workgroup as well. The wait for the first work item then becomes a special case, as does the update of the marker for the last line.

Here’s the initialisation part of the kernel code:

__kernel void fs2( __global uchar        *src,                // The source greyscale image buffer                  __global uchar        *dst,                // The destination buffer                  __global uchar        *err_buf,            // The distribution of errors buffer                  __global uint          *workgroup_rider,    // A single rider used to create a unique workgroup index                  __global volatile uint *workgroup_progress,  // A buffer of progress markers for each workgroup                  __local volatile uint  *progress,            // The local buffer for each workgroup                  uint                  width)                // The width of the image
{    __local volatile uint        workgroup_number;    /* We need to put the workgroups in some order. This is done by        the first work item in the workgroup atomically incrementing        the global workgroup rider. The local progress buffer - used        by the work items in this workgroup also needs initialising..      */      if (get_local_id(0) == 0)            // A job for the first work item...      {            // Get the global order for this workgroup...            workgroup_number        = atomic_inc(workgroup_rider);            // Initialise the local progress markers...            for (int i = 0; i < get_local_size(0); i++)                progress[i]        = 0;      }      barrier(CLK_LOCAL_MEM_FENCE);        // Wait here so we know progress buffer and                                          // workgroup_number have been initialised

Note the use of the 'volatile' keyword when defining some of the variables here. This hints to the compiler that these values can be changed by other threads, thereby avoiding certain optimisations that might otherwise be made.

The barrier in the code is also something to highlight. There are often better ways than using barriers, typically using some kind of custom semaphore system. The barrier here however is only used as part of the initialization of the kernel, and is not used subsequently within the loop. Even so, I implemented a version that used a flag for each workgroup, setting the flag once the initialization has been done during the first work item’s setup phase, and then sitting and checking for the flag to be set for each of the other work items. It was a useful exercise, but didn’t show any noticeable difference in performance.

With initialization done, it’s time to set up the loop that will traverse across the line of pixels:

      /* The area of the image we work on depends on the workgroup_number determined earlier.        We multiply this by the workgroup size and add the local id index. This gives us the        y value for the row this work item needs to calculate. Normally we would expect to        use get_global_id to determine this, but can't here.      */      int                y = (workgroup_number * get_local_size(0)) + get_local_id(0);      int                err;      int                sum;      for (int x = 1; x < (width - 1); x++)  // Each work item processes a line (ignoring 1st and last pixels)...      {          /* Need to ensure that the data in err_buf required by this              workitem is ready. To do that we need to check the progress              marker for the line just above us. For the first work item in this              workgroup, we get this from the global workgroup_progress buffer.              For other work items we can peek into the progress buffer local              to this workgroup.              In each case we need to know that the previous line has reached              2 pixels on from our own current position...          */          if (get_local_id(0) > 0)          // For all work items other than the first in this workgroup...          {              while (progress[get_local_id(0) - 1] < (x + 2));          }          else                              // For the first work item in this workgroup...          {              if (workgroup_number > 0)                  while (workgroup_progress[workgroup_number - 1] < (x + 2));          }

At the top of each loop we need to ensure the line above has got far enough ahead of where this line is. So the first item in the work group checks on the progress of the last line in the previous workgroup, whilst the other items check on the progress of the line above.

After that, we’re finally ready to do the Floyd-Steinberg calculation for the current pixel:

          sum = src[(width * y) + x] + err_buf[(width * y) + x];          if (sum > THRESHOLD)          {              err                  = sum - THRESHOLD;              dst[(width * y) + x] = 0xff;          }          else          {              err                  = sum;              dst[(width * y) + x] = 0x00;          }          // Distribute the error values...          err_buf[(width * y)      + x + 1] += (err * 7) / 16;          err_buf[(width * (y + 1)) + x - 1] += (err * 3) / 16;          err_buf[(width * (y + 1)) + x    ] += (err * 5) / 16;          err_buf[(width * (y + 1)) + x + 1] += (err * 1) / 16;

The final thing to do within the main loop is to set the progress markers to reflect that this pixel is done:

          /* Set the progress marker for this line...              If this work item is the last in the workgroup we set              the global marker so the first item in the next              workgroup will pick this up.              For all other workitems we set the local progress marker.          */          if (get_local_id(0) == (get_local_size(0) - 1))      // Last work item in this workgroup?              workgroup_progress[workgroup_number]  = x;          else              progress[get_local_id(0)]              = x;      }

There’s one more thing to do. We need to set the progress markers to just beyond the width of the image so subsequent lines can complete:

      /* Although this work item has now finished, subsequent lines          need to be able to continue to their ends. So the relevant          progress markers need bumping up...        */      if (get_local_id(0) == (get_local_size(0) - 1)) // Last work item in this workgroup?          workgroup_progress[workgroup_number]        = width + 2;      else          progress[get_local_id(0)]                  = width + 2;  }

A Word about Warp

Before I talk about performance – and risk getting too carried away – it’s worth considering again the following line:

while (progress[get_local_id(0) - 1] < (x + 2));

This loop keeps a work item waiting until a progress marker is updated, ensuring the processing for this particular line doesn’t proceed until it’s safe to do so. The progress marker is updated by the thread processing the line above. Other than the use of barriers, inter-thread communication is not specifically ruled out in the specification for OpenCL 1.1 or 1.2. But neither is it specifically advocated. In other words, it is a bit of a grey area. As such, there is a risk that behaviour might vary across different platforms.

Take wavefront (or “warp”)-based GPUs for example. With wavefront architectures threads (work items) are clustered together into small groups, each sharing a program counter. This means threads within such a group cannot diverge. They can go dormant whilst other threads follow a different conditional path, but ultimately they are in lock-step with each other. This has some advantages when it comes to scalability, but the above line in this case will stall because if a thread was waiting for another in the same warp, the two can never progress.

The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based. With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine. But it should be easy enough to accommodate wavefront by replacing the above 'while' loop with a conditional to determine whether the thread can continue:

Current method for the main loop	Alternative method supporting wavefront-based architectures
for (x = 1; x < (width - 1); x++) { Wait for line above to be >= 2 pixels ahead process pixel x update progress for this line }	for (x = 1; x < (width - 1); ) { if line above is >= 2 pixels ahead { process pixel x update progress for this line x++ } }

for (x = 1; x < (width - 1); x++)

{

Wait for line above to be >= 2 pixels
ahead

process pixel x

update progress for this line

}

for (x = 1; x < (width - 1); )

{

if line above is >= 2 pixels ahead

{

process pixel x

update progress for this line
x++

}

The right-hand version allows the loop to iterate regardless of whether the previous line is ready or not. Note that in this version, x now only increments if the pixel is processed.

It’s also worth mentioning that as all the work items in the same wavefront are in lock-step by design, once the work items have been started further checking between the threads would be unnecessary. It might be feasible to optimise a kernel for a wavefront-based GPU to take advantage of this.

How did it do?

Technically, the above worked, producing an identical result to the CPU reference implementation. But what about performance? The OpenCL implementation ran between 3 and 5 times faster than the CPU implementation. So there is a really useful uplift from the GPU version. It would also be possible to create a multithreaded version on the CPU, and this would potentially provide some improvement. But remember that if this filter stage was part of a chain running on the GPU, with the above solution we can now slot this right in amongst the others, further freeing the CPU and removing those pesky sync points.

Moving the CPU kernel to the GPU will remove the need for cache synchronization in the above example

And what about further room for improvement? There are all sorts of optimisation techniques we would normally advocate, and those steps have not been explored in detail here. But just for example, the byte read and writes could be optimised to load and store several bytes in one operation. There are links at the bottom of this post to some articles and other resources which go into these in more detail. With a cursory glance however it doesn’t appear that many of the usual optimisation suspects would apply easily here… but nevertheless I would be fascinated if any of you out there can see some interesting ways to speed this up further. In the meantime it is certainly encouraging to see the improvement in speed which the Mali GPU brings to the algorithm.

Platform used for case study

CPU: ARM Cortex®-A15 running at 1.7GHz

GPU: ARM Mali-T604 MP4 running at 544MHz

Further resources

For more information about optimising compute on Mali GPUs, please see the various tutorials and documents listed here:

GPU Compute, OpenCL and RenderScript Tutorials - Mali Developer Center Mali Developer Center

This work by ARM is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. However, in respect of the code snippets included in the work, ARM further grants to you a non-exclusive, non-transferable, limited license under ARM’s copyrights to Share and Adapt the code snippets for any lawful purpose (including use in projects with a commercial purpose), subject in each case also to the general terms of use on this site. No patent or trademark rights are granted in respect of the work (including the code snippets).

↧

MediaTek scales the mobile market with Mali™-T720

February 18, 2015, 3:14 am

≫ Next: Running OpenCL on Chromebook remotely

≪ Previous: When Parallelism Gets Tricky: Accelerating Floyd-Steinberg on the Mali GPU

We’ve all come to expect our portable gadgets to wow us with their smooth, 3D, feature-rich displays and their ability to support fast-action, console-quality games. But I wonder how many people stop to think about the innovation and technology that underpins those stunning user experiences.

Innovation

Here at ARM we’re all about innovation – it’s what we do! The resulting technologies are then licensed to the world’s leading semiconductor companies who make the chips for all those exciting products that we have come to love. One of ARM’s leading technology brands is Mali – and ARM® Mali™ GPUs are at the heart of a wide range of successful consumer devices – providing that ‘wow factor’ we talked about at the beginning.

But let’s just step back for a moment and consider how the ‘innovate, develop, implement’ cycle can be sustained at such a fast pace that each product generation has features even more compelling than the last. The key principle here is to blend the introduction of new technologies and features with the smart reuse of existing hardware and software platforms - in order to leverage the maximum return from every investment and to ensure rapid time to market (TTM).

Scalability

The scalability of Mali GPUs perfectly aligns with the reuse paradigm; the performance of a design can be tuned by simply varying the number of cores within the GPU. This, combined with reuse of the same driver and software framework, means a wide range of products - from entry-level, cost-sensitive designs through to those that are high-end and feature-rich - can quickly be brought to market.

MediaTek

One of ARM’s SoC partners is MediaTek, a company based in Taiwan. MediaTek always impresses me with the speed at which it innovates and brings product to market. I’m pleased to say that MediaTek is a partner for Mali GPU technology across its product range. A pair of recent announcements highlight how MediaTek has used the scalability of Mali GPUs and ARM Cortex® processors to good effect. In October last year, MediaTek announced details of their MT6735, an SoC for the mainstream that uses a four-core Cortex-A53 processor and a Mali-T720 MP2 GPU. In the past few days MediaTek has followed this with the announcement of the MT6753. This latest SoC is aimed at high-end applications and uses an eight-core Cortex-A53 processor combined with a Mali-T720 MP3 GPU.

The use of common processor and GPU types across the product range allows the TTM benefits of scalability to be realized; MediaTek comments, ‘MT6753 is compatible with the previously released MT6735, which can significantly shorten the product development cycle’. According to Mr Hsieh, president of MediaTek, there will also be a variant of the MT6735 to address particular low-end market requirements – the MT6735M.

These news pieces from MediaTek are yet another indication that 2015 is going to be an exciting year for Mali GPUs and Cortex processors.

↧

Running OpenCL on Chromebook remotely

February 25, 2015, 1:14 am

≫ Next: ARM at GDC 2015: A One-Stop-Shop for Mobile Game Developers

≪ Previous: MediaTek scales the mobile market with Mali™-T720

If you have followed my instructions on installing OpenCL™ on the Samsung Chromebook or on the Samsung Chromebook 2, you may be wondering what's next. Well, optimising your code for the ARM® Mali™ GPUs, of course! If you are serious about using your Chromebook as a development board, you may want to know how to connect to it remotely via ssh, and use it with the lid closed. In this blog post, I'll explain how. All the previous disclaimers still apply.

Enabling remote access to Chromebook

I assume your Chromebook is already in the developer mode (and on the dev-channel if you are really brave).

Making the root file system writable

Open the Chrome browser, press Ctrl-Alt-T and carry on to enter the shell:

Welcome to crosh, the Chrome OS developer shell.

If you got here by mistake, don't panic!  Just close this tab and carry on.

Type 'help' for a list of commands.

crosh> shellchronos@localhost / $

Using sudo, run the make_dev_ssd.sh with the --remove_rootfs_verification flag:

chronos@localhost / $ sudo /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification     ERROR: YOU ARE TRYING TO MODIFY THE LIVE SYSTEM IMAGE /dev/mmcblk0.  The system may become unusable after that change, especially when you have  some auto updates in progress. To make it safer, we suggest you to only  change the partition you have booted with. To do that, re-execute this command  as:    sudo ./make_dev_ssd.sh --remove_rootfs_verification --partitions 4  If you are sure to modify other partition, please invoke the command again and  explicitly assign only one target partition for each time  (--partitions N )  
ERROR: IMAGE /dev/mmcblk0 IS NOT MODIFIED.

Note the number after the --partitions flag and rerun the previous command with this number e.g.:

chronos@localhost / $ sudo /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions 4
Kernel B: Disabled rootfs verification.
Backup of Kernel B is stored in: /mnt/stateful_partition/backups/kernel_B_20150221_224038.bin
Kernel B: Re-signed with developer keys successfully.
Successfully re-signed 1 of 1 kernel(s)  on device /dev/mmcblk0.

Finally, reboot:

chronos@localhost / $ sudo reboot

Creating host keys

Create keys for sshd to use:

chronos@localhost / $ sudo ssh-keygen -t dsa -f /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key
Generating public/private dsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key.
Your public key has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key.pub.chronos@localhost / $ sudo ssh-keygen -t rsa -f /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key.pub.

You can leave the passphrase empty (hit the Enter key twice).

Enabling password authentication

Change the PasswordAuthentication setting in /etc/ssh/sshd_config to 'yes':

chronos@localhost / $ sudo vim /etc/ssh/sshd_config
# Force protocol v2 only
Protocol 2

# /etc is read-only.  Fetch keys from stateful partition
# Not using v1, so no v1 key
HostKey /mnt/stateful_partition/etc/ssh/ssh_host_rsa_key
HostKey /mnt/stateful_partition/etc/ssh/ssh_host_dsa_key

PasswordAuthentication yes
UsePAM yes
PrintMotd no
PrintLastLog no
UseDns no
Subsystem sftp internal-sftp

Starting sshd

Allow inbound ssh traffic via port 22 and start sshd:

chronos@localhost / $ sudo /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT chronos@localhost / $ sudo /usr/sbin/sshd

Change the root password (no, I'm not showing you mine):

chronos@localhost / $ sudo passwd
Enter new UNIX password: 
Retype new UNIX password: 
passwd: password updated successfully

Connecting from another computer

Check the IP address of your Chromebook:

chronos@localhost / $ ifconfig
lo: flags=73  mtu 65536        inet 127.0.0.1  netmask 255.0.0.0        inet6 ::1  prefixlen 128  scopeid 0x10        loop  txqueuelen 0  (Local Loopback)        RX packets 72  bytes 5212 (5.0 KiB)        RX errors 0  dropped 0  overruns 0  frame 0        TX packets 72  bytes 5212 (5.0 KiB)        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

mlan0: flags=4163  mtu 1500        inet 192.168.1.70  netmask 255.255.255.0  broadcast 192.168.1.255        inet6 fe80::26f5:aaff:fe26:ee0a  prefixlen 64  scopeid 0x20        ether 24:f5:aa:26:ee:0a  txqueuelen 1000  (Ethernet)        RX packets 10522  bytes 3356427 (3.2 MiB)        RX errors 0  dropped 8  overruns 0  frame 0        TX packets 6516  bytes 1956509 (1.8 MiB)        TX errors 3  dropped 0 overruns 0  carrier 0  collisions 0

(In this case, the IP address is 192.168.1.70.)

You should now be able to connect from another computer e.g.:

[lucy@theskyofdiamonds] ssh root@192.168.1.70localhost ~ # whoami
root

Making sshd start on system startup

To make sshd start on system startup, add a script to /etc/init e.g.

chronos@localhost / $ sudo vim /etc/init/sshd.conf
start on started system-services
script  /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT   /usr/sbin/sshd
end script

(A two-space indent is sufficient for the script block.)

Enabling passwordless connection from another computer

Generate a public/private key pair from another computer e.g.:

[lucy@theskyofdiamonds] ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.

Copy the public key to your Chromebook e.g.:

[lucy@theskyofdiamonds] ssh-copy-id root@192.168.1.70
The authenticity of host '192.168.1.70 (192.168.1.70)' can't be established.
RSA key fingerprint is 58:2d:89:e7:52:5c:b4:85:1e:79:e0:23:e8:36:f0:c2.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
Password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@192.168.1.70'"

and check to make sure that only the key(s) you wanted were added.

[lucy@theskyofdiamonds] ssh root@192.168.1.70
Last login: Sun Feb 22 00:08:45 GMT 2015 from 192.168.1.74 on sshlocalhost ~ #

Keeping your Chromebook awake with the lid closed

With the lid open, your Chromebook's GPU may be rendering graphics and running compute tasks concurrently. This may create undesired noise when obtaining performance counters. To keep the Chromebook awake when you close the lid, connect to the Chromebook and disable power management features:

[lucy@theskyofdiamonds] ssh root@192.168.1.70
Last login: Sun Feb 22 00:48:41 GMT 2015 from 192.168.1.74 on sshlocalhost ~ # stop powerd

Check that when you close the lid, you can still "talk" to the Chromebook e.g. launch tasks.

To enable power management again, enter:

localhost ~ # start powerd

↧

ARM at GDC 2015: A One-Stop-Shop for Mobile Game Developers

February 27, 2015, 10:00 am

≫ Next: Supporting the development of mobile games at GDC 2015

≪ Previous: Running OpenCL on Chromebook remotely

Just as September marks the turn of the year for schoolchildren; April marks the turn of the year for taxes; so too does the Game Developers Conference mark the climax of the year for anyone in the gaming industry. ARM is no different. The work of our ecosystem team begins and ends in March, with demos being finalized, developer guides written off and tools being released all for this time. With GDC coinciding with MWC in Barcelona this year, mobile game developers can definitely expect a week full of exciting announcements.

In the field of mobile game development, ARM recognizes the challenges. While mobile devices have the biggest reach of any gaming platform, the thermal and battery constraints have not traditionally made them a straightforward target for visually stunning games. However, increasingly advanced processors and energy efficient technologies are hitting the market each year and with IP such as the ARM® Mali™-T880 GPU and ARM Cortex®-A72 processors in the pipeline, designed specifically to deliver high-end gaming, tomorrow's premium mobile experiences are being redefined.

This year we have a stunning lineup of new demos that form a one-stop-shop for cutting-edge mobile development techniques, all based on the latest hardware. If you’re starting to work with APIs such as OpenGL® ES 3.1 or WebGL, come and find out how to use compute shaders for occlusion culling, or how WebGL games can rival the visual quality of those built in OpenGL ES. For those working with popular game engines such as Unity or Unreal, we have brand new demos featuring battery-saving techniques such as Pixel Local Storage and ASTC as well as tips for driving up visual quality in mobile games using Enlighten’s global illumination solution, reflections, refractions and shadows. 64-bit mobile gaming is now present in leading engines and we will be showcasing the performance improvements available both on the booth and in our sponsored sessions.

This week we announced updates to three of our most popular Mali graphics tools including a plug-in to Unreal Engine 4 for the Offline Shader Compiler. The Offline Shader Compiler allows you to analyze your materials and get advanced mobile statistics while previewing the number of arithmetic, load & store and texture instructions in your code. The OpenGL ES Emulator receives support for geometry and tessellation shaders and enables users to start developing for the Android Extension Pack (AEP) as well as OpenGL ES 2.0, 3.0 and 3.1. The Mali Graphics Debugger has gained support for 64-bit Android, improved live shader editing and now enables the Android Extension Pack (AEP) to be traced. The upgrades to the Emulator and the Debugger are available for download now; the Offline Shader Compiler plug-in is being previewed at GDC.

Joining us on the ARM booth will be partners who share our ambition to make the production of high-quality mobile games as easy as possible. Cocos2d-x, who recently announced the integration of ARM’s DS-5 Streamline into the Cocos Code IDE to enable developers to simply optimize their games, will be sharing their extremely popular engine with attendees. Tencent, the world-leading, free-to-play publisher and #1 brand in China will join the ARM booth with their innovative titles for mobile. Simplygon’s automatic 3D asset optimization middleware is ideal for increasing the performance of your mobile game. For those facing the challenge of smartphone market diversity, Testin’s quality assurance testing suite is a blessing for confirming the performance of your application across a variety of devices. PlayCanvas’ ever-popular WebGL game engine that’s free, open source and backed by amazing developer tools will be showing a new demo featuring some well known ARM characters!

All of these demonstrations will be accompanied by live sessions and in-depth explanations by the engineers who developed them on the in-booth ARM lecture theatre. The full schedule and more information about ARM at GDC is available at Mali Developer Center. We look forward to seeing you on the ARM booth #1624 next week!

↧

Supporting the development of mobile games at GDC 2015

March 2, 2015, 1:24 pm

≫ Next: Forging the Future of Video Games with Enlighten 3

≪ Previous: ARM at GDC 2015: A One-Stop-Shop for Mobile Game Developers

Over 25,000 game developers go yearly to San Francisco for the Game Developer Conference (GDC) in order to see and hear the latest features and capabilities of the game engines, games middleware, developer tools and hardware platforms.

At today’s Google Developers day it was announced that a whooping $7 billion revenue has been given to their app developers. And this is set to continue growing thanks to lower cost smartphones and tablets exposing millions of people of all ages and from all walks of life, to video games for the first time.

From a technical perspective, each year we are seeing, on average, a 30 to 50 per cent increase in the performance of mobile devices. The computational power of mobile GPUs is already largely on par with that of the Xbox 360 and PlayStation 3. There are still challenges around, like the availability of memory bandwidth, but ARM is developing techniques to overcome these which developers can reach via our sample code, tutorials, tools and developer guides available at our developer portal, and our latest demos of these techniques will be shown and explained during our GDC talks.

The GDC developer audience have extremely divers educational needs, from the game artist creating the game assets, the visual environment and characters, to the game developers using a specific game engine or middleware, and to the developers designing their own game engine or not using any. Therefore, we have shaped our developer tutorials and resources to fit the audience diversity.

At GDC 2015, we start our talk sessions with “Unreal Engine 4: Mobile Graphics on ARM CPU and GPU architecture”, showing first of all, how Epic Games’ game engine has been ported into the latest ARMv8 architecture, showcasing the results with a bespoke game demo from Epic Games called Moon Temple.

For game developers, the ARMv8 architecture mainly translates to porting their game to a 64-bit OS, and the latest Android “L” already includes 64-bit support. Apple has also mandated the support of 64-bit for all new iOS 8 apps. The session continues with the tile-based ARM® Mali™ GPU architecture, showing how to reduce external memory bandwidth by keeping memory transactions localized to fast on-chip memory. The light bloom effect of the Moon Temple demo is developed using that technique, implemented via the Khronos OpenGL® ES extension “Shader Pixel Local Storage”. Other sample code using this extension are also available here. Another highlight of the talk is the ASTC integration into Unreal Engine 4. ASTC is a texture compression standard developed by ARM and adopted by Khronos. ASTC allows free choice of multiple bit rates across all supported input texture formats, from LDR to HDR formats, as well as the ability to compress 3D textures. At our developer portal we have sample code and further tutorials on it.

Furthermore, the Enlighten middleware by Geomerics, an ARM company enables dynamic global illumination and is also available pre-integrated into Unreal Engine. A full session is dedicated to it which reveals the latest features and advances in Enlighten, and the collaboration with Unreal Engine and Unity.

Another hot topic for developers is to learn how best to use the latest API features and for mobile and embedded devices, OpenGL ES is the 3D graphics API of choice. Our talk Unleash the benefits of OpenGL ES 3.1 and Android Extension Pack (AEP),focuses on the main new highlight which is compute shaders, allowing the GPU to be used for general-purpose computing. Previously, developers had to learn a different API (such as OpenCL™) if they wanted to use GPU Compute. The session covers compute shader techniques and the best coding practises on Mali Midgard GPUs. It showcases a few of the sample codes which are already available at our developer portal. The other highlight of the talk is the Android Extension Pack (AEP) and its best coding practices. AEP requires OpenGL ES 3.1 and it is an optional feature in the latest Android “L” OS release. AEP enables around 20 other extensions, including tessellation, geometry shaders and ASTC.

Tools are key for developers so that they can debug and profile their code, finding out where the performance bottlenecks are so they can optimize their application. At the talk How to Optimize your Mobile Game with ARM Tools and Practical Examples the Mali Graphics Debugger (MGD) and DS-5 Streamline are shown, with further live sessions at our ARM booth lecture theatre. The MGD traces all the API calls that the graphics application makes; in particular it supports OpenGL ES 2.0, 3.x and EGL. The tool is complementary to DS-5 Streamline, which gives a system wide view of the performance of the application. MGD v2.1 has just been launched to be showcased at GDC 2015, and the key features include the support for Android 64-bit targets and its capability of tracing the Android Extension Pack functions.

Last but not least, there is a talk session aimed at Unity developers: Enhancing your Unity Mobile Games. Unity is the most widely used game engine, and from our developer surveys from developer events and Mali Developer Center, we understand that up to 50% of game developers use Unity. The session is given jointly with Unity and RealtimeUK, the company who created the 3D assets for the brand new Ice Cave demo, premiering this week on the ARM booth. Developers will learn the differences when developing for mobile, as well as the bottlenecks they might encounter and how to overcome them, referring to all the work done in our ARM Guide to Unity. It goes on to cover the use of the local cubemap technique for reflections, and then, inspired by the technique, we show a new way of rendering dynamic soft shadows in real-time, which is one of the key additions on our ARM Guide to Unity refresh to be released later this year.

↧

Forging the Future of Video Games with Enlighten 3

March 3, 2015, 9:00 am

≫ Next: ARM Collaboration with Popular Game Engine Providers

≪ Previous: Supporting the development of mobile games at GDC 2015

“What made Leonardo’s paintings so revolutionary was his use of light and shadow, rather than lines, to define three-dimensional objects.” – The National Gallery (nationalgallery.org.uk)

Great artists use lighting to convey emotions and tell stories. This is true whatever the medium, be it paint, film, or the latest video game. For computer-generated imagery, an accurate simulation of how light interacts with materials is essential and this is where global illumination (GI) - how light bounces around a scene - can be used to deliver incredible visual realism.

The big challenge in computer graphics is performing dynamic global illumination in real time as it has traditionally been computationally intense. And this is exactly why dynamic GI is interesting to ARM in our mission to deploy efficient technology wherever computing happens.

In 2013, ARM acquired Geomerics and their Enlighten technology, the game industry’s most advanced dynamic lighting solution. Enlighten is incredibly scalable – from fully baked to totally dynamic lighting, from PC and console to mobile and from small rooms to large environments.

While Enlighten is and always will be optimized to scale and run on any hardware platform, ARM’s design teams benefit from understanding the type of processing required to deliver cutting edge games; this in turn influences and informs our processor roadmaps.

This week at GDC we launched Enlighten 3 with Forge. The innovation in Enlighten 3 ensures it remains at the cutting edge of lighting technology; it also includes a new lighting editor and workflow tool called Forge which makes it easier for artists and developers to take advantage of the incredible visual quality on offer in Enlighten.

You can find more details on Enlighten 3 and Forge on the Geomerics website.

Since taking over the running of Geomerics in ARM I have been staggered by the popularity of the technology. Whether it is 40,000 YouTube hits in a week for a demo video or standing room only in a series of customer meetings in Japan a couple of weeks ago, the developer mindshare we have with Enlighten is significant. When we released our Realistic Rendering demo in 2014 Epic

Games founder and CEO Tim Sweeney said:

“This is gorgeous! I remember having dreams about this kind of dynamic indirect lighting back when I was building the Unreal Engine 1 renderer!”

2015 looks set to be even more exciting than 2014 as we see Enlighten reach tens of thousands of developers via Unity 5.

Steven Spielberg once said,

“You shouldn't dream your film, you should make it!”

...maybe with Enlighten that should apply to your game as well.

↧

ARM Collaboration with Popular Game Engine Providers

March 5, 2015, 10:02 am

≫ Next: Mali Performance 5: An Application's Performance Responsibilities

≪ Previous: Forging the Future of Video Games with Enlighten 3

You have probably all seen the announcements at GDC 2015 (going on right now). The big engine guys are battling it out for the attentions of developers large and small. Epic Games are giving away Unreal Engine 4 for free and Unity Technologies have released Unity 5 (of which the Personal Edition contains all engine features). But it doesn’t stop there; Chukong Technologies continues to improve the widely adopted open source engine Cocos2d-x and PlayCanvas are showcasing at GDC right now with their stunning WebGL engine and ARM has been excited to sponsor their PLAYHACK March competition with a great prize of a Samsung Chromebook 2 (Exynos 5 Octa and ARM® Mali™-T628 MP6 GPU) for the winning entry.

As all the major players continue to offer amazing engines with incredible features and continue to reduce the cost to developers, the barrier to entry reduces too. This is great news for all developers but especially for new aspiring indies keen to break into the industry.

As the ecosystem continues to expand and the audience of games becomes ever wider, ARM is incredibly excited to take on the challenge of ARMing (pardon the pun) developers with the tools and education they need to optimise their games for both performance and energy consumption. At GDC 2015 we’re proud to be, once again, working with key games industry players.

Moon Temple, a collaboration between ARM and Epic Games

We have been collaborating with Epic Games to port Unreal Engine 4 to 64-bit and add ASTC texture compression. This can be seen in our joint demo called Moon Temple based on an existing demo from Epic. As well as 64-bit and ASTC, you’ll also see Pixel Local Storage in action; this allows us to perform energy efficient rendering by keeping data on-chip to reduce bandwidth consumption. Full details of this demo were explained in our joint talk with Epic – “Unreal Engine 4 Mobile Graphics on ARM CPU and GPU Architecture (Presented by ARM)”. Keep an eye out on the GDC Video Vault if you missed it.

Ice Cave demo from ARM; created with Unity 5

Last year, ARM released the “ARM Guide to Unity: Enhancing Your Mobile Games”, a popular guide teaching beginner and intermediate developers how to optimise for Mali and implement cool effects. We’ll be updating the guide again this year but as a special sneak peek into what is coming, check out the Ice Cave demo from our internal demo team. Created with the latest Unity 5 (now available), this demo makes use of real-time global illumination (employing Enlighten, the new lighting solution inside Unity 5) and adds some cool reflection, refraction and soft shadows. All to be detailed in the ARM Guide to Unity later this year! ARM continues to work ever more closely with Unity and we can’t wait for what the future holds. Be sure to check out our GDC talk “Enhancing Your Unity Mobile Games (Presented by ARM)“, also to be available in the GDC Video Vault.

Cocos2d-x is a very popular open source engine and we’re very pleased to have been working closely with the creators Chukong Technologies for a long time now. ARM is especially proud to have been able to help optimise the engine for Mali and help with the transition as the mobile world moves to ARMv8 and 64-bit. To make great games, you need a great engine backed by useful, easy to use and powerful tools. Recently you may have seen the announcement for the integration of ARM DS-5 into the Cocos IDE. This will allow developers the ability to efficiently debug their C++ code and continue to optimise for ARM and push the boundaries!

Seemore WebGL created with PlayCanvas

The barriers to entry for game developers are reducing but the possibilities continue to widen. ARM enables partners to create high performing energy efficient devices and we’re seeing more and more content running across multiple platforms. Console quality engines are already running on smartphones and tablets and that trend continues with browser technology. We’ve been collaborating with PlayCanvas to showcase what is possible with HTML5 and WebGL on current generation mobile devices. PlayCanvas have created a stunning rendition of our popular demo SeeMore. With stunning lighting and physically based rendering, all running on last year’s Samsung Galaxy Note 10.1 2014 Edition (Mali-T628 MP6), developers can now easily deploy high quality visuals across browser-based apps and games.

At time of writing, there are only two days remaining of GDC 2015, if you’re lucky enough to be there, be sure to head over to the ARM Booth 1624 and check out all the demos and technology on show!

↧

Mali Performance 5: An Application's Performance Responsibilities

March 9, 2015, 4:16 pm

≫ Next: From Hours to Milliseconds: Project Ice Cave

≪ Previous: ARM Collaboration with Popular Game Engine Providers

Previous blog in the series: Mali Performance 4: Principles of High Performance Rendering

Welcome to the next instalment of my blog series looking at graphics rendering performance on Mali GPUs using OpenGL ES. This time around I'll be looking at some of the important application-side optimizations which should be considered when developing a 3D application, before making any OpenGL ES calls at all. Future blogs will look in more detail at specific areas of usage for the OpenGL ES API. Note that the techniques outlined in this blog are not Mali specific, and should work well on any GPU.

With Great Power Comes Great Responsibility

The OpenGL ES API specifies a serial stream of drawing operations which are turned into hardware commands for the GPU to perform, with explicit state control over how those drawing operations are to be processed. This low level mode of operation gives the application a huge amount of control over how the GPU performs its rendering tasks, but also means that the device driver has very little knowledge about the whole scene that the application is trying to render. This lack of global knowledge means that the device driver cannot significantly restructure the command stream that it sends to the GPU, so there is a burden of responsibility on the application to send sensible rendering commands through the API in order to achieve maximum efficiency and high performance. The first rule of high performance rendering and optimization is "Do Less Work", and that needs to start in the application before any OpenGL ES API calls have happened.

Geometry Culling

All GPUs can perform culling, discarding primitives which are outside of the viewing frustum or which are facing away from the camera. This is a very low level cull which is applied primitive-by-primitive, and which can only be applied after the vertex shader has computed the clip-space coordinate for each vertex. If an entire object is outside of the frustum, this can be a tremendous waste of GPU processing time, memory bandwidth, and energy.

The most important optimization which a 3D application should therefore perform is early culling of objects which are not visible in the final render, skipping the OpenGL ES API calls completely for these objects. There are a number of methods which can be used here, with varying degrees of complexity, a few examples of which are outlined below.

Object Bounding Box

The simplest scheme is for the application to maintain a bounding box for each object, which has vertices at the min and max coordinate in each axis. The object-space to clip-space computation for 8 vertices is sufficiently light-weight that it can be computed in software on the CPU for each draw operation, and the box can be tested for intersection with the clip-space volume. Objects which fail this test can be dropped from the frame rendering.

For very geometrically complex objects that cover a large amount of screen space it can be useful to break the object up into smaller pieces, each with its own bounding box, allowing some sections of the object to be rejected if the current camera position would benefit from it.

The images above show one of our old Mali tech demos, an interactive a fly through of an indoor science fiction space station environment. The final 3D render is shown on the left and a content debug view, in which shows the bounding boxes of the various objects in the scene are highlighted in blue, is shown on the right.

Scene Hierarchy Bounding Boxes

This type of bounding box scheme can be taken further, and turned into a more complete scene data structure for the world being rendered. Bounding boxes could be constructed for each building in a world, and for each room in each building, for example. If a building is off-screen then it can be rejected quickly, based on a single bounding box check, instead of needing hundreds of such checks for all of the individual objects which that building contains. In this hierarchy the rooms are only tested if their parent building is visible, and renderable objects are only tested if their parent room is visible.

Portal Visibility

In many game worlds simple bounding box checks against the viewing frustum will remove a lot of redundant work, but still leave a significant amount present. This is especially common in worlds consisting of interconnected rooms, as from many camera angles the view of the spatially adjacent rooms will be entirely blocked by a wall, floor, or ceiling.

The bounding box scheme can therefore be supplemented with pre-calculated visibility knowledge, allowing for more aggressive culling of objects in the scene. For example in the scene consisting of three rooms shown below, there is no way that any object inside Room C can be seen by the player standing in Room A, so the application can simply skip issuing OpenGL ES calls for all objects inside Room C until the player moves into Room B.

This type of visibility culling is often factored into game designs by the level designers; games can achieve higher visual quality and frame rates if the level design keeps a consistently small number of rooms visible at any point in time. For this reason many games using indoor settings make heavy use of S and U shaped rooms and corridors as they guarantee no line of sight through that room if the doors are placed appropriately.

This scheme can be taken further, allowing us to cull even Room B in our test floor plan in some cases, by testing the coordinates of the portals - doors, windows, etc. - linking the current room and the adjacent rooms against the frustum. If no portal linking Room A and Room B is visible from the current camera angle, then we can also discard the rendering of Room B.

These types of broad-brush culling checks are very effective at reducing GPU workload, and are impossible for the GPU or GPU drivers to perform automatically - we simply don't have this level of knowledge of the scene being rendered - so it is critical that the application performs this type of early culling.

Face Culling

It should go without saying that in addition to not sending off-screen geometry to the GPU, the application should ensure that the render state for the objects which are visible is set efficiently. For culling purposes this means enabling back-face culling for opaque objects, allowing the GPU to discard the triangles facing away from the camera as early as possible.

Render Order

OpenGL ES provides a depth buffer which allows the application to send in geometry in any order, and the depth-test ensures that the correct objects end up in the final render. While throwing geometry at the GPU in any order is functional, it is measurably more efficient if the application draws objects using a front-to-back order, as this maximizes the effectiveness of the early depth and stencil test unit (see The Mali GPU: An Abstract Machine, Part 3 - The Shader Core for more information on early depth and stencil testing). If you render objects using a back-to-front order then there is a good chance that the GPU will have spent some cycles rendering some fragments, only to later overdraw them with a fragment which is closer to the camera, which is a waste of precious GPU cycles!

It is not a requirement that triangles are sorted perfectly, which would be very expensive in terms of CPU cycles; we are just aiming to get it mostly right. Performing an object-based sort using the bounding boxes or even just using the object origin coordinate in world space is often sufficient here; anywhere where we get triangles slightly out of order will be tidied up by the full depth test in the GPU.

Remember that blended triangles need to be rendered back-to-front in order to get the correct blend results, so it is recommended that all opaque geometry is rendered first in a front-to-back order, and then blended triangles are drawn last.

Using Server-Side Resources

OpenGL ES uses a client-server memory model; client-side memory resembles resources owned by the application and driver, server-side resembles resources owned by the GPU hardware. Transfer of resources from application to server-side is generally expensive:

The driver must allocate memory buffers to contain the data.
Data must be copied from the application buffer into the driver-owned memory buffer.
Memory must be made coherent with the GPU view of memory. On unified memory architectures this may imply cache maintenance, on card-based graphics architectures this may mean an entire DMA transfer from system memory into the dedicated graphics RAM.

For reasons dating back to the early OpenGL implementations - namely that geometry processing was performed on the CPU, and did not using the GPU hardware at all - OpenGL and OpenGL ES have a number of APIs which allow client-side buffers for geometry to be passed into the API for every draw operation.

glVertexAttribPointer allows the user to specific per-vertex attribute data.
glDrawElements allows the user to specify per-draw index data.

Using client-side buffers specified this way is very inefficient. In most cases the models used each frame do not change, so this simply forces the drivers to perform a huge amount of work allocating memory and transferring the data to the graphics server for no benefit. As a much more efficient alternative OpenGL ES allows the application to upload data for both vertex attribute and index information to server-side buffer objects, which can typically be done at level load time. The per-frame data traffic for each draw operation when using buffer objects is just a set of handles telling the GPU which of these buffer objects to use, which for obvious reasons is much more efficient.

The one exception to this rule is the use of Uniform Buffer Objects (UBOs), which are server-side storage for per-draw-call constants for used by the shader programs. As uniform values are shared by every vertex and fragment thread in a draw-call it is important that they can be accessed by the shader core efficiently as possible, so the device drivers will generally aggressively optimize how they are packaged in memory to maximize hardware access efficiency. It is preferable that small volumes of uniform data per draw call should be set directly via the glUniform<x>() family of functions, instead of using server-side UBOs, as this gives the driver far more control over how the uniform data is passed to the GPU. Uniform Buffer Objects should still be used for large uniform arrays, such as long arrays of matrices used for skeletal animation in a vertex shader.

State Batching

OpenGL ES is a state-based API with a large number of state settings which can be configured for each drawing operation. In most non-trivial scenes there will be multiple render states in use, so the application must typically perform a number of state change operations to set up the configuration for each draw operation before the draw itself can be issued.

There are two useful goals to bear in mind when trying to get the best performance out of the GPU and minimizing CPU overhead of the drivers:

Most state changes are low cost, but not free, as the driver will have to perform error checking and set some state in an internal data structure.
GPU hardware is designed to handle relatively large batches of work, so each draw operation should be relatively large.

One of the most common forms of application optimization to improve both of these areas is draw call batching, where multiple objects using the same render state are pre-packaged into the same data buffers and as such can be rendered using a single draw operation. This reduces CPU load as we have fewer state changes and draw operations to package for the GPU, and gives the GPU bigger batches of work to process. My colleague Stacy Smith has an entire blog dedicated to effective batching here: Game Set and Batch. There is no hard-and-fast rule on how many draw calls a single frame should contain, as a lot depends on the system hardware capability and the desired frame rate, but in general we would recommend no more than a few hundred draw calls per frame.

It should also be noted that there is sometimes a conflict between getting the best batching and removing the most redundant work via depth sorting and culling. Provided that the draw call count remains sensible and your system is not CPU limited, it is generally better to remove GPU workload via improved culling and front-to-back object render order.

Maintain The Pipeline

As discussed in one of my previous blogs, modern graphics APIs maintain an illusion of synchronous execution but are really deeply pipelined to maximize performance. The application must avoid using API calls which break this rendering pipeline or performance will rapidly drop as the CPU blocks waiting for the GPU to finish, and the GPU goes idle waiting for the CPU to give it more work to do. See Mali Performance 1: Checking the Pipeline for more information on this topic - it is was important enough to warrant and entire blog in its own right!

Summary

This blog has looked as some of the critical application-sde optimizations and behaviours which must be considered in order to achieve a successful high performance 3D render using Mali. In summary the key things to remember are:

Only send objects to the GPU which have some chance of being visible.
Render opaque objects in a front-to-back render order.
User server-side data resources stored in buffer objects, not client-side resources.
Batch render state to avoid unnecessary driver overhead.
Maintain the deep rendering pipeline without application triggered pipeline drains.

Tune in next time and I'll start looking at some of the more technical aspects of using the OpenGL ES API itself, and hopefully even manage to include some code samples!

TTFN,

Pete

Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

↧

From Hours to Milliseconds: Project Ice Cave

March 13, 2015, 3:09 am

≫ Next: Heterogeneous System Architecture 1.0

≪ Previous: Mali Performance 5: An Application's Performance Responsibilities

RealtimeUKare a Creative CG Studio that has spent the last 18 years forging a strong reputation within the games industry for the quality of our Cinematics, Animation and Marketing Imagery.Whilst our team often work with developers in creating in-engine assets, it has been the quality of our trailer work for which we have become best known. Trailers for such games as 'War Thunder', 'World of Tanks', 'Smite' and 'Total War' have become the benchmark for what can be achieved, in terms of visual fidelity, by using a pre-rendered solution. Our studio has always been very ambitious in this regard and has always tried to maintain its position in being a leadingcreative production studiothatdeliversexceptional qualitybypioneering cutting edge solutions for the video games industry.

With this in mind, when ARM first approached us with the ‘Ice Cave Demo’ project, our team were very keen to explore the ways in which we could collaborate with one of the world’s leading developersofSemiconductor IP.With their technical know-how inreal-timemobilegraphics technologyand ourstrength in producing compelling visuals, it seemed like a match made in heaven. It was a great opportunity to explore the ways in which we could collaboratively push the quality of the next generation of mobile graphics processors.

We initially treated this production the same as any other – what will be the most effective way of realizing this production and making it as visually compelling as possible? What struck us first about the piece was its ambition – even to produce the ice cave using our existing pre-rendered pipeline approach would prove challenging enough.The brief specified the need for real-time global illumination, reflection, refraction and soft shadows. Addin the budgets of the ARM® Mali™-T760 GPU-based platform and it was clear that we would haveto plan the project carefully.

Fortunately, we already had a head start having recently completed 'SMITE: Battleground of the Gods' in which we had invested in exploring cutting edge techniques for creating ice and snow.

Given the ambitions we had in our own mind for the quality we wanted to achieve, the first steps we took were tohighlight any technical concerns that stood in the way of us effectively realizing this production. Having done this, we then made some creative decisions that would best enable us to create something extremely 'high end' that would allow us to make the most of this collaboration and enable us both to get the best out of the platform.This process helped us realise the challenge of working on mobile devices, where budgets are very small in comparison to other platforms. Armed with this information, we created a brief for our pre-productiondepartment to produce a piece of concept that wouldallow us to get the maximum visual quality, respect the restricted performance budget of mobile devices and also help ARM showcase the features they wanted to demo. This was further refined in line with discussions with ARM.

Using this as a starting point, the team atRealtimeUKfurther refined the vision following useful technical meetings with the team at ARM. The budgets for this production were dictated by the platform upon which the demo would run. In this instance, the target was to show the demo running on one of ARM’s latest designs in its ARM Cortex®-A Series - the powerful processor that is used in most of the world’s mobiles and tablets.

Whilst its core specs were incredibly powerful, they were far from the kind of processing power that we were used to wherepre-rendered trailers could take up totwo hours per frame to render. Although initially daunting,the only real deviation away from our usual pipeline was that all the assets had to be built more economically. Whereas all assets are ordinarily builtwith great integrity and technical precision as standard, it is usually not so much of an issueif apiece of geometry is built with excessive triangle counts for a pre-rendered production. Of course, this is not the case where the geometry is intended for an in-engine production where every tri and UV Map counts. With this in mind, our artists carefully followed the guidelines set by the team at ARM and modified the authoring pipeline so that the assets could easily be integrated into Unity. Using Unity on this project helpedspeed upthe project flowmore easily and was a new process for ARM who had, up until this point, used their own proprietary engine to realise their technical demos. In particular, Unity’s prefab feature was a big help for our collaboration, with Unity helping to refine our workflow and techniques. Having a large community of Unity users was helpful to both parties and made fora less complicated production as it removed the need to learn a bespoke engine.

Once these factors had been taken into consideration, the asset creation itself was relatively straight forward with our team generating the normal, diffuse, specular and any other maps to the usualRealtimeUKstandards. A separatededicated teamwas used to create and rig the characters that would feature in the demo as were the effects which were created with a more artistic lead approach.

Once all the assets were built,the team atRealtimeUKwere able to render them to match its intended vision. Using our own well tested pipeline in this process, which replicated the finallighting rig, we were able to share a pre-rendered version of the assets as a movie with the team at ARM. The intention with this was that the ARM team would then be able to realise our visionin Unity by replicating it as closely as possible, using theshaders but basing them on the properties of our visual target sequence thatthey had created specifically for the project.

The end result is something we’re all proud of atRealtimeUK and has been regarded as a huge success by ARM. Both RealtimeUK and ARM have learnt from the creation of this demo and it is something that ARM’s ecosystem will benefit from, with the results feeding in to the 'ARM Guide to Unity: Enhancing your Mobile Games' which is available to view on the Mali Developer Center. As ever, our team has constantly pushed what is achievable in terms of graphical fidelity, regardless of the final platform. Whilst no one will deny that there still remains a clear discrepancy between what can be achieved in-engine and from a pre-rendered pipeline,it’snice to know that collaborative efforts between studios likeRealtimeUKand technology pioneers like ARM are closing the gap.

↧

Heterogeneous System Architecture 1.0

March 17, 2015, 3:19 am

≫ Next: PLAYHACK with ARM, winner announced!

≪ Previous: From Hours to Milliseconds: Project Ice Cave

The Heterogeneous System Architecture 1.0 specifications have now been released.

Jem Davies has talked about earlier releases of the HSA specification which gave an overview of the Programmer's model, and it's great to see the System Architecture, Programmer's Reference Manual and Runtime specifications finalised and available for download. The HSA System Architecture specifies hardware which fits into modern SoCs and a clean Programming Model for software and compiler design which fits well into modern operating systems. It's well suited to CPUs, GPUs, DSPs and any other devices which support offloading of computation tasks, programmable or otherwise.

For developers, HSA provides a simple and consistent interface to the hardware and then gets out of the way. Moreover, HSA exposes hardware in a standardised way that has support from a large number of companies in both the Desktop and Mobile space. One of the key things this allows for is opening up acceleration API design to more developers; HSA is meant to be built on.

What do you do with it?

You build APIs on top of it.

There's a lot of active debate on what the right API for accelerating general purpose programs in heterogenous systems is, it's still a young area, as evidenced by a large number of APIs available today. We're still trying to find the right API (or APIs) to make it easier to speed up programs and make them more power efficient. HSA makes it easier for any person or company to prototype or develop better tools and APIs for heterogeneous systems.

For domain specific issues it also means that the API design on top of HSA can be made to suit the problem at hand, and those who best understand their problem can design the solution.

What doesn't it do?

HSA is a low level interface with a focus on direct hardware support, so it's not typically something you would program to directly if you're writing applications. It's also got a relatively high bar to entry, so if writing your own compiler front-end isn't something you had in mind, you should probably look at APIs built on top of HSA.

The good thing is, there are already API implementations written on top of HSA to get you started. Take a look at the open source OpenCL C compiler and C++ AMP implementations that are already available.

What's next?

HSA simplifies the programming model by providing mechanisms consistent with those seen for CPU development.

Sharing of page tables means system allocations are available on HSA agents simply by using standard OS allocation mechanisms; the process address space is now simply "extended" to all capable HSA agents.
Full coherency support provides an intuitive parallel programming model, and better performance when compared to software methods when sharing resources
User mode queues with a defined hardware packet format simplifies generating work for offload. Be it asynchronous batching of commands or immediate blocking work, the low latency of user mode queues is well suited.
HSAIL provides a standard compiler target for frontends. It can also be built to at the same time as host CPU code, simplifying toolchain integration when using accelerators.

The hope is that HSA can become the basis for innovation for heterogeneous compute APIs. As implementations continue to advance and overheads to use of accelerators continue to reduce, it will become easier to offload small pieces of work to accelerators.

Benefits for mobile development

HSA allows for communicating agents at the hardware level, without CPU involvement. This means fewer cycles where your CPU needs to be active and the CPU isn't unnecessarily used as a communication path between two independent units. It also means (with the help of support libraries and OS software) that common formats can be used on multiple devices in the system and no copies are requred for reformatting data shared between devices. With the flat address space of the HSAIL model, programmers can target normal CPU algorithms more easily to other devices and not have to specialise algorithms as heavily for features like local scratch memory or segmented addressing.

Mobile SoCs are also more constrained by the power and thermal budget available in the small form factors of phones and tablets. As a result, mobile development is more sensitive to overheads from copying or driver validation; HSA avoids this by adding hardware support for coherency that can be easily extended to the non-CPU devices in the sysem

Full coherency and a shared address space mean algorithms can be working on different areas of an input buffer without having to carefully partition work statically to cache line boundaries. This allows programs to run across more devices in the system, often at a lower frequency and voltage, resulting in lower total energy for equivalent computations. This can be used to either speed up the computation, or to conserve battery.

A promising future

Having HSA as a low level hardware interface means it is naturally applicable to more problems and opens up the development of higher level APIs which can now accelerate general purpose compute on GPUs, DSPs and other devices. The task based queue interface implemented in user process mapped memory allows for low overhead task dispatch and minimal CPU involvement.

HSA hardware complements existing standards such as OpenCL, SYCL, C++ AMP and OpenMP 4, with features useful for many of these APIs. HSA isn't a revolution but an evolution and standardisation of key hardware features that open up development to a much larger community.

↧

PLAYHACK with ARM, winner announced!

April 2, 2015, 9:21 am

≫ Next: What's new in Mali Graphics Debugger 2.1 and OpenGL ES Emulator 2.1

≪ Previous: Heterogeneous System Architecture 1.0

Hi everyone,

Just wanted to write a quick blog with the news that the PLAYHACK with ARM competition that launched on the ARM booth at GDC 2015 has finished! The competition ran throughout March and was to create the best WebGL game using the PlayCanvas engine and the ARM buggy asset as seen below. The prize won is a Chromebook 2 13.3" full specs at the end.

The winning game is Space Buggy by lmao, head over to the PlayCanvas blog to see the full announcement and honourable mentions to the runners up.

PLAY SPACE BUGGY HERE

You can also check out the SeeMore WebGL demo which runs in the Mali Developer Center see the screenshot below:

Samsung Chromebook 2 13.3”

Samsung Exynos 5 Octa (5800)
ARM® Mali™-T628 MP6 GPU
ARM Cortex®-A15 MP4 and Cortex-A7 MP4 CPUs
ARM big.LITTLE™ processing

↧

What's new in Mali Graphics Debugger 2.1 and OpenGL ES Emulator 2.1

April 10, 2015, 2:24 am

≫ Next: Live editing OpenGL ES shaders with Mali Graphics Debugger

≪ Previous: PLAYHACK with ARM, winner announced!

Every year at GDC, we like to present some important updates regarding the development tools for game developers that target devices with ARM^® Mali™ GPUs. In 2013,we previewed Mali Graphics Debugger v1.0, which was then released a few weeks later. Exactly one year later, at GDC 2014, we showcased v1.3, which included the brand new frame replay feature (see User Guide Section 5.2.10 for details), a new binary format for traces and many performance improvements. In the meantime, we had already implemented advanced features like Frame Capture, Shader Map, Overdraw Map, support for ASTC textures and shader statistics. Version 1.3 has been the most utilized version of that tool, supporting the Khronos APIs OpenGL^® ES 1.1, 2.0 and 3.0, as well as EGL and OpenCL™.

Mali Graphics Debugger has been extremely useful to a wide range of developers, from our internal GPU driver teams, to our silicon partners and OEMs, to game engines and games developers, and this is why GDC is such an important event for us.

This year at GDC 2015, we released version 2.1, based on the brand new version 2.0, released right at the end of last year. In the latest version we have made some major improvements to the tool including:

OpenGL ES 3.1 and Android Extension Pack support

Now Mali Graphics Debugger can trace all the functions that are supported in the Mali GPU drivers, and even more, to allow early support for some that are still being developed. This means that all OpenGL ES 3.1 function calls will be present in the trace, and most of the OpenGL ES extensions can be captured seamlessly.

OpenGL ES 3.1 adds support for features like compute shaders, which is a flexible way to manipulate general purpose buffers using the GPU, so that workload can be moved from the application processor to the graphics one. Other features of OpenGL ES 3.1 are indirect draw calls, to allow the GPU to manage the draw calls rather than doing it on the CPU and enhanced texture features like offscreen multisampling. The extensions included in the Android Extension Pack support geometry and tessellation shaders, in addition to the ASTC texture compression format and many other features. OpenGL ES 3.1 and a selection of features of the Android Extension Pack are now supported in Mali Graphics Debugger and in our Mali OpenGL ES emulator.

Support for Android 64-bit

(Or technically, ARMv8-A AARCH64 devices)

Android 5.0 introduces platform support for 64-bit architectures, including ARMv8-A devices. We have ported the Mali Graphics Debugger target components to 64-bit architectures, and we have extensively tested it on our Juno ARM Development Platform (getting started), which is equipped with ARM Cortex^®-A57 and Cortex-A53 MPCore™ CPUs for ARMv8-A big.LITTLE™ processing and a Mali™-T624 GPU for 3D graphics acceleration and compute. This has been particularly useful to have to port the Epic Games’ Moon Temple demo to 64-bit. Now it is available to everyone, and we are looking forward to trying it on the brand new Samsung Galaxy S6 phones.

Live editing is becoming even more powerful

Mali Graphics Debugger allows users to edit shaders, override textures and precision while capturing an application. This is done by replaying the same frame, with modified assets, over and over on the target device.

With version 2.1 you can now:

Change both the fragment and vertex shader of a program and replay the frame to view the results.
Override textures in an application and replace them with a new texture that will aid in diagnosing any issues with incorrect texture coordinates.
Override the precision of all elements in a shader and replay the frame to view the results (force highp/mediump/lowp modes).

New Android application provided to support unrooted devices

With the objective of making the installation of the graphics debugger on Android targets easier, we have developed an Android application that runs the required daemon. This eliminates the need to manually install executables on the Android device. The application (APK) works on rooted and unrooted devices.

New features for GPU compute

Mali GPUs don't just render graphics, but they also support general purpose computing, which can be done with compute shaders in OpenGL ES or OpenCL, depending on the use case. In this version, we have a new view for compute shaders, displaying the same shader statistics as the vertex and fragment shaders, which can be very useful for optimizing them and finding bottlenecks.

For OpenCL developers we have also added support for GPUVerify, a tool for formal analysis of GPU kernels written in OpenCL.

GPUVerify was originally designed by Alastair Donaldson (Imperial College London), and has been supported by ARM, among other partners. Read the detailed paper here.

Availability and support

As always, tools provided by ARM are supported in the ARM Connected Community. You can ask a question in the Mali Developer Forums, follow us on Twitter, Sina Weibo, or watch our YouTube, YouKu channels.

↧

Live editing OpenGL ES shaders with Mali Graphics Debugger

April 10, 2015, 6:27 am

≫ Next: ARM Mali Graphics Week - April 13 -17, 2015

≪ Previous: What's new in Mali Graphics Debugger 2.1 and OpenGL ES Emulator 2.1

With Mali Graphics Debugger you can edit OpenGL® ES shaders on the fly on your Android or Linux device while the game is still running. In fact, the tool will replay a frame over and over with modified shaders, so you can check the output on the display, or capture the frame for further inspection. This feature comes particularly useful if the output does not look quite like the one you expected, if you need to experiment with different color and alpha values for blending, or to develop post-processing effects.

Dynamic editing

This is different from static shader editing (or material editing), because with Mali Graphics Debugger you are not working on a single shader in isolation. Instead you are editing it in the context of the actual frame it will be used on, with all the actual assets, textures, post-processing effects and camera position.

Live shader editing demo

Here's a demonstration of live shader editing. In this video the Epic Citadel demo is captured with Mali Graphics Debugger and one of its shaders is being modified. Finally, a frame is replayed with the modified shader, to show its effect.

0:08 Capturing Epic Citadel

0:17 Enabling shader map mode, to see what shader is used to draw the sky

0:30 Shader 3, inside Program 1, is the one we are going to edit

0:41 We are multiplying the RGB values of the final color by (1, 0, 0), which means that we keep only the RED channel

0:50 Replay the frame with the modified shader

0:57 Capture the modified frame

Additional information

Download the Mali Graphics Debugger and for more information: Mali Graphics Debugger - Mali Developer Center Mali Developer Center

You can find other videos about Mali Graphics Debugger in Tutorials: ARM Mali - YouTube and ARM - YouTube

Have you tried this yet? What do you think of it, and what would you like to see in the next version of Mali Graphics Debugger?

↧

ARM Mali Graphics Week - April 13 -17, 2015

April 13, 2015, 1:06 am

≫ Next: Dynamic Soft Shadows Based on Local Cubemap

≪ Previous: Live editing OpenGL ES shaders with Mali Graphics Debugger

Gemma ParisARM is hosting Graphics Week in the ARM^® Connected Community. This is a roundup of the tools and resources to help developers
get the most out of the latest hardware, along with proven tools to debug and optimize their apps and techniques for producing
high-quality visuals on mobile platforms.

During Graphics Week, we will share blogs and videos of tools that will help game developers simplify the development process
and deliver console quality content to mobile platforms. Topics we will cover include:

Compute Shaders
Shadows Based on Local Cubemaps
Ice Cave Demo with Unity 5 and Enlighten™
Updates to the Latest Tools such as ARM Mali™ Graphics Debugger
Lighting Mathematics by Geomerics an ARM company
General lighting and Games by Geomerics an ARM company