Quantcast
Channel: ARM Mali Graphics
Viewing all 266 articles
Browse latest View live

[VIDEO] Goo Technologies make HTML5 and WebGL look easy

$
0
0

The arrival of HTML5 and WebGL is one of the biggest leaps forward in the history of the browser.

 

It's a big claim. In this video from ARM® TechCon Developer Summit Marcus Krüger of Goo Technologies explains why he thinks it's true.

 

 

If you're interested in finding out more about what the ARM Mali team is doing in the field of HTML5, Matt Spencer is the best person to ask - check out his latest blog on the subject HTML5, The Art of Illusion.

 

What are your thoughts on the subject? Have you tried developing with HTML5 yet? Share your experiences below.


ARM’s GPU roadmap; the WHY

$
0
0

With the recent launch of two new products at ARM® TechCon™ 2013, one High-end product (the ARM Mali™-T760 GPU) and one Mid-range product (the ARM Mali-T720 GPU), ARM further extended its lead as the preferred GPU IP supplier for the mobile industry. Over the past month we have received a significant level of positive feedback from the industry concerning these two new products and our decision to further enhance our GPU roadmapsRoadmaps.png

ARM separated its GPU roadmap into two parts for a reason. One size does not fit all. One product cannot possibly be best at everything. With our ARM Mali GPUs, our partners are creating devices targeted at a variety of end-users in a tremendous diversity of markets. Yet DTVs, set-top boxes, watches, mobile phones, tablets and so on and so forth all have slightly different requirements. While ARM Mali GPUs are all highly scalable - making them easy to adapt to the market of choice - accessing the different price points within all of those different markets requires some wider thought.

 

The general rule of thumb is that, for the High-end roadmap, performance is the key. This goes for the GPU as well as the CPU. And it is all about performance within a constrained power and thermal budget. Even though DTVs are plugged into the wall socket, they still have to dissipate the heat from the SoC – and as we all know, DTVs are getting amazingly thin. The same story goes for mobile, watches and tablets. In the High-end it is all about performance; and consumers want performance without limitations. Following the latest standards and features is of the essence here – the highest performance points require hardware that supports the latest APIs. GPU Compute is one example where the performance, as felt by the user, improves when having compute support in the GPU. The same can be said for OpenGL® ES 3.0. It is these added features that not only make the visuals feel better but also pushes the performance envelope within the limited power and thermal budget. The ARM Mali-T760 GPU pushes the performance envelope further than any previous ARM Mali GPU – and at the same time adds support for new features never before seen in mobile.

 

As for the Mid-range roadmap, the markets this roadmap addresses are all about cost. The cheaper the products then the more people can afford to buy - or upgrade to one. The consumer wants to buy a product that can handle the latest content with a reasonable level of performance. The product shouldn’t just be able to barely open the latest game; it should be able to give a good user experience with a certain quality of look and feel. So what does 'cost' equate to in this part of the roadmap as seen from an IP suppliers view? Or to phrase the question a bit differently; how can ARM help to keep the device cost down? Seen from the end-users perspective the cost of the device is the sum of its hardware components, software components, marketing costs, distribution costs and profit margin. The profit margin of the OEMs is not in ARM’s direct control. Neither are the distributions costs. But the other three: hardware costs, software costs and marketing costs are all things ARM can help to minimise. How? Let me go through each of them in detail.

 

Hardware

When someone is assembling, for example, a mobile phone, the major cost drivers are the System on Chip (SoC) and the battery. The SoC cost is proportional to the die area of the Integrated Circuit (IC). Having a smaller GPU (in terms of mm2) helps to reduce cost, so ARM works to ensure that each GPU is as area efficient as possible. The bandwidth required for the GPU to render given content is another cost driver; could one reduce the number of external memory pins by having a more bandwidth efficient GPU? If that is the case, the cost could be reduced further without impacting the performance. That’s why ARM is working hard in order  to increase the bandwidth efficiency of each GPU.  And then there is the power budget. It is not surprising that a smaller battery cost less, but, in addition, having the SoC emitting less thermal energy will also reduce the requirements for the chassis of the device, making the choice of materials and design much wider. And with a wider choice comes further cost savings.

 

Software

Hardware is nothing without software. All hardware from ARM is designed with software in mind. This is reflected in our GPU software strategy. Using the same basic architecture (the Midgard architecture) across multiple products lowers the software development cost, and with it the price end users have to pay. But the GPU by itself can only do so much without the rest of the system. The GPU needs something to hand work to it, and this is where the CPU comes in. The various ARM Cortex® CPUs are great pieces of hardware.  In order for a given task to be run as efficiently as possible the CPU has to interact with the GPU as well as possible. Some of this interaction happens purely through hardware (as is the case with I/O Coherency), whilst most of the interaction happens in a software layer. Building drivers for the complex GPUs that runs on the CPU is not an easy task, and in order to make it as effective as possible knowledge of not only the CPU, but also the GPU is of the essence. ARM leverages its system wide expertise to keep software costs down by sharing its knowledge across the different system components, such as the CPU and the GPU.

 

Marketing cost – benefits of the ARM Ecosystem

Ever heard of the ARM Ecosystem? The ARM Ecosystem is all about our partners coming together to leverage their strengths. One could argue that the marketing cost of a phone, is not within ARM’s influence. But I would say that ARM is saving  costs of a device by enabling partners to leverage close connections with other partners when launching products. The ARM partnership is not only for Silicon vendors, it is for the various OEMs, software vendors, game publishing companies, foundries and so on. If someone is going to launch a ground breaking product, why wouldn’t he leverage the ARM partnership? The marketing message can be much stronger and cost could at the same time be minimized. And of course ARM wants its partners to be successful when launching new products.

 

Looking at the Mid-range roadmap with its newly launched ARM Mali-T720 GPU and the more mature ARM Mali-450 GPU, you can clearly see all of the above factors play key roles in these products. Both are created to suit their respective markets with a clear emphasis on cost. Whilst the ARM Mali-450 GPU supports OpenGL ES 1.1 and 2.0, the ARM Mali-T720 GPU supports OpenGL ES 1.1, 2.0, 3.0 and the Compute APIs Renderscript/Filterscript– important when looking forward in the mobile space.

 

Another thing that is clear is that the two roadmap sections, one High-end and one Mid-range, are not mutually exclusive. While the high-end has a clear focus on performance and the mid-range focuses on cost, this does not mean products from each of the two parts do not inherit properties from each other.

 

If you have any questions regarding our GPU products and our GPU roadmap please post them below!

Ittiam and ARM are the first to efficiently bring Google’s VP9 to mobile devices

$
0
0

This blog post was written in conjunction with Ittiam Systems, a technology company singularly focused on embedded media centric systems. For more information concerning the technology below, please contact Mukund Srinivasan, Ittiam System's General Manager of Consumer and Mobility Business at ittiamsystemsadmin or Roberto Mijat, ARM's Visual Computing Marketing Manager.Ittiam Systems' solutions will be showcased at several partner booths at CES this week, including ARM's and Samsung System LSI's.

 

UPDATE: Since publishing, Google has complimented Ittiam's VP9 Decoder: "It's exciting to see the progress that ARM and Ittiam have made optimizing VP9 to maximize the compute capabilities of ARM's Mali GPU platforms. By leveraging the GPU on mobile devices, connected TVs and other embedded platforms, manufacturers will be able to quickly offer high-resolution, power-efficient VP9 video with software-based solutions" - Matt Frost, Senior Business Product Manager, WebM Project, Google.

 

YouTube is the prominent global-scale video creation and consumption platform and is home to mind-blowing statistics: 1B+ monthly users, 4B+ video views per day, 6B+ hours watched each month, 72+ hours of video uploaded each minute. High Definition uploads have now started to overtake the number of Standard Definition uploads. The importance of storage cost savings and the requirement of lower bandwidth consumption for the same or better quality of video delivery, online and in a connected world, has never been so important. With H.265 and VP9 staking claims to significant bandwidth savings over H.264 and VP8 respectively, the call for incorporating these new-age standards has grown stronger and louder from the entire market, from silicon vendors catering to the set top box market, all the way to application processors embedded within smartphones. With all the advantages touted, it is only a matter of time before we see the Online Video Services Ecosystem widely adopt these cutting-edge standards as the codecs of choice for user-generated content consumption via platforms like YouTube. The other big opportunity that a highly compression-efficient standard harbours is the advantage of having VP9 as a medium of storage and delivery, on the cloud, with the decode happening at the last mile on the client device, thereby lending significant cost savings to operators and online video service providers alike.

 

These paradigm shifts open a unique window of opportunity for focused, media-related Intellectual Property providers like Ittiam Systems®. The main hurdle to cross comes in the form of how to manage the enhanced compute demands of a standard like VP9 on mobile devices, since it bears significantly more complex coding tools as compared to VP8 or H.264. However, collaborating with a longstanding partner and a technology pioneer like ARM® enabled us to generate original solutions to the complex problems posed in the design and implementation of consumer electronic systems. These problems could include how to balance features and performance without severely denting the energy-performance curve (battery-life) of the mobile device, or lowering the entry barrier by bringing these newer codec standards to the swiftly-growing lower-end of the mobile market. Thanks to some outstanding support from various quarters across ARM: be it the marketing team, technical liaison and, last but definitely not least, the tremendous leadership backing, Ittiam Systems have been able to deliver a truly innovative solution which lowers the entry barrier for VP9-capable performance on smartphone devices and futuristic platforms.

 

The Ittiam VP9 Decoder, built in collaboration with ARM and Google, focuses on power, scale and portability with equal importance given to each. It runs at 1080p 30fps leveraging the ARM Mali™-T604 GPU on an Arndale board powered by Samsung’s Exynos 5 Dual SoC. In order to reduce the bandwidth cost of VP9 on mobile devices, the Decoder offloads compute intensive tasks to the ARM Mali-T604 GPU using standard GPU Compute APIs, with significant leverage of the GPU to achieve an overall lower power profile. With intelligent partitioning of the algorithm and by identifying unique, compute-intensive functions that are well suited for GPU processing we are able to significantly lower the load imposed on the CPU by the VP9 codec, leading to advantages on multiple fronts. These advantages include freeing the CPU for other system or user tasks which results in better responsiveness and performance; lowering CPU clock demands; or, most importantly, offering significantly longer battery-life thanks to the energy efficiencies gained by this GPU Compute solution. The implementation helps tremendously in achieving class-leading levels of performance, with wide coverage of support across a variety of ARM Mali T6xx GPU generations, enabling higher performance with use of the on-board GPU.

 

 

The Ittiam VP9 Decoder in action

 

Finally, this solution helps deliver a strategic advantage to customers at multiple levels: be it reducing time-to-market, receiving an integrated solution from Ittiam, leveraging the wide suite of complementary multimedia IP on the ARM platform, or getting unparalleled support for integration and future interoperability.

ARM MPD at Mobile Game Forum 2014

$
0
0

Hello and Happy New Year!

 

So, here we are at the beginning of another new year preparing for Mobile Game Forum (MGF) in the Centre of London.  This event is well attended by developers, game engine providers, middleware experts, payment and analytics companies to list but a few.  ARM will be there to show cutting edge mobile hardware running the latest and greatest game related software created both internally and by our Ecosystem partners.  Some of the demos we will be showing are:

 

Samsung Galaxy Note 3 running an internal demo developed to show the benefits of the new texture compression format ASTC.  ASTC is the latest texture compression method adopted by the Khronos group as part of the new OpenGL® ES 3.0 standard.  Using ASTC developers and artists are able to specify many parameters to ensure their content fits within the specified memory footprint while maintaining artefact free quality.  Other positive side effects of better texture compression are less bandwidth consumption as textures are loaded from memory to the GPU thus yielding significant power savings and improving latency and load times.  These power savings, quality improvements and load times are all viewable in our live demos at MGF.  It’s easy to see why ARM designed and donated such an efficient texture compression algorithm to Khronos – we’re obsessed with power saving processing technology.

 

We will also be showing the new Samsung Galaxy Note 10.1 2014 edition which also contains the Samsung Exynos 5420 Octa processor.  The Exynos 5420 contains eight ARM® Cortex CPU® cores and a six-core ARM Mali™-T628 GPU to drive the 2xHD resolution display (HD TVs are ~2MPix while the Note 10.1 drives ~4MPix).  To show off the outstanding visual capabilities of this device we have been given a sneak preview of the new Vendetta Online developed by Guild Software.  This new build makes use of very high quality 2048 x 2048 textures with advanced shader techniques to give a truly immersive gaming experience.  This game originally started life as a PC title and has now been ported to mobile highlighting the narrowing gap between Mobile/Console/PC.

 

AmLogic have kindly equipped us with their new Set top Box prototype which contains their new AML 8726-M8 SoC which contains a quad-core ARM Cortex-A9 CPU running at 2GHz!! And a six core implementation of the ARM Mali-450 GPU clocked at an amazing 600MHz!  I wonder what they have in mind for such a high performance chip?  How about running 4K content at vSync frame rates!  As shown at CES at the beginning of January we will be showing UHD 4K  gaming and UI content developed by our Ecosystem partner Autodesk ScaleForm.  I chose the fantastic Samsung UE55F9000 55” 4K TV to present this content.  Did you know the UE55F900 also contains a Dual core Cortex-A15 CPU and quad-core Mali-T604 GPU, which I can only presume is used to render the high quality DTV User Interface.  Could high quality games be coming to this platform too?

 

Also shown on our demo tabletop is the Samsung Chromebook running the Chrome web browser which was heavily optimised for Mali GPU-based devices by our driver team who worked closely with Google.  To show off its performance I have chosen to run the new Flight Simulator demo kindly provided by our Ecosystem partner Goo Technologies.  Try running this (http://photonfrog.com/tunnan/next/client.html) on your browser now and let me know if your computing device gets noisy and hot as the fan spins up.  Note, the Chromebook doesn’t have a power inefficient cooling fan and its battery life is measured in days rather than hours.

 

We couldn’t attend a game event without showing the latest demo from Unity so I have chosen to show the Chase demo running on a standard Samsung Galaxy Note 3 as it supports OpenGL ES 3.0 right out of the box at Native 1080p Full HD resolution.  This demo shows off more advanced graphics features than I have space for in this blog but be prepared to be wow’d by the sub surface scattering techniques heavily used in this tech demo.  For more details of this particular demo have a look at its premier showing at SIGGRAPH 2013 where it was running on an upgraded Nexus 10.

 

 

 

We will also be showing the Samsung Galaxy Gear wrist watch running some of the demos we showed last year on the world’s most powerful smart phones.  This watch implements very similar silicon to the Samsung Galaxy S handsets with its dual core Cortex processor and quad core Mali 400 GPU and, as expected, it handles high end console quality game content with ease.

 

Mobile Game Forum will also be the premier event for public showing of the latest technological advances from Geomerics with their Englighten Global Illumination middleware.  This demo shows many dynamic light sources illuminating a scene in real time without any ambient lighting while also saving over 1Gb/second of memory bandwidth at HD resolution – yet another example of how ARM is driving down energy requirements while improving performance and user experience!

 

See you at the Mobile Game Forum show on 22nd and 23rd January .

 

Phill

ARM Mali GPU technology at CES - what's new in 2014

$
0
0

ARM shone at CES this year and there have already been some fantastic pieces of coverage on our community about it - check out Latest ARM powered tech at CES 2014 with Samsung Galaxy Note Pro 12.2 and Galaxy Tab Pro 12.2, the Snakebyte Vyper gaming experience, and Jigabot AIMe auto-tracking camera mount, All the best bits of CES2014 with ZTE Nubia 5S, Pine Smartwatch, Parrot Sumo, Audi S3, and NVIDIA Tegra K1 or Final day of CES2014 with Sphero, ARM CEO Simon Segars, Makerbot Replicator 2 3D Printer, and Anki Drive smart toy car racing for more details. However, if you were looking for some more graphics highlights from CES, look no further than here!

 

 

Firstly, at the start of January, hype had started to build around the VP9 and HEVC demos that would be on display across a variety of booths on the show floor.  In addition to their VP9 decoder (see Ittiam and ARM are the first to efficiently bring Google’s VP9 to mobile devices), Ittiam Systems have done a fantastic job in developing an HEVC decoder which, as shown in the demo above, is optimized for ARM and runs off the Samsung Exynos Dual SoC at 1080p 30fps whilst only using a minimal fraction of the processing resource available. Extrapolations from this demo suggest that the Samsung Exynos Dual Core processor is more than capable of running 4K content at an efficient level using this decoder - watch this space for more details!

 

Adaptive Scalable Texture Compression (ASTC) has been talked about a lot in these blogs  - give it a quick search if you need to find out more - but it is only in the past couple of months that it has become available on shipping devices (the Samsung Galaxy Note 3 with an ARM® Mali-T628 GPU was the first out). This next demo, an adaptation of the well-received SeeMore demo, is the first time that ASTC will have been seen in public. To start with, the demo compares the energy consumption of uncompressed, ETC2 and ASTC textures. Let us know in the comments below if you can see a decline in image quality from ETC2 compression when the more efficient ASTC is applied - we bet you can't!

 

 

Because as you can see in the second half of the video, the "Difference Map" highlights the visual artifacts which the texture compression produces. What do you think? Will you consider using ASTC in your next project?

 

Do you need a processor with GPU Compute to render 4K graphics? As this next demo shows, no. The six-core ARM Mali-450 GPU (600MHz) and quad-core ARM Cortex®-A9 CPU (2GHz) in the AMlogic A8726-M8 SoC have enough processing power to deliver high quality, smooth 4K gaming and user interfaces (courtesy in this demo of Autodesk Scaleform).

 

 

But GPU Compute is extremely useful when it comes to next generation computer vision and gesture recognition, as this next demo shows. eyeSight Technologies have perfected their gesture recognition algorithms. Their solution is fast, smooth, responsive and intuitive even at lower light levels. It works by using OpenCL to harness the extra computational capacity of the GPU so that complicated, parallel workloads can be performed more efficiently and at higher frame rates. This demo not only shows how the solution works, but also compares its performance when implemented on the CPU or the GPU. Although it is being shown here as a DTV/home entertainment use case, it is more than suitable for other situations, for example in the automotive industry.

 

 

Got any questions about the demos above? Feel free to ask us a question in the comments below - we'd be happy to help out!

The Future of Mobile Gaming discussed at #MGF2014

$
0
0

I've just come back from spending two extremely informative days at Mobile Games Forum in London. An event that focuses on the business side of game development, it provides developers with the information they need to not only bring their game to market efficiently, but to do it so successfully that it becomes the viral app of 2014. With a keen focus on marketing information, industry experts set out what they consider to be the main opportunities and threats to the industry, explaining where the trends lie and what they see to be the next game-changing event on the horizon.

 

Now, I will admit here and now (for better or worse) that I am not a developer nor have I ever been - I am a marketing professional through and through (there, I've said it!) and to me this event was fascinating. The mobile gaming market is one of huge potential and massive opportunities. In just ten years it has undergone serious evolution and it is not looking to slow down any time soon. For people producing applications within the market, where development lead times can be anywhere from a year to ten years, it's extremely volatile and who knows whether the plans you have for a fantastic game now will be relevant in three years' time? And there are other issues too. Different global markets have different attitudes to mobile games, so what do you do if you're looking to globalize? Market saturation is so high, how do you make your app stand out above the crowd? In a society that clearly prefers free-to-play, where are the opportunities for profit? These are questions we rarely ask in these blogs, yet they are problems that members of our ecosystem face every day.

 

Thankfully, informative market data was coming in from all angles. The US and Japan are unsurprisingly two of the largest markets for mobile games in 2013 (both in terms of downloads and revenue) - yet Brazil, Russia, Mexico and Turkey are all making great strides within the list of Top 10 mobile gaming countries and well worth a consideration for developers trying to find a niche market where their games might stand out above a thinner crowd. Action games are more popular in download terms than Arcade games, which are more popular in turn than Puzzle games, but Role Playing games monetize a lot better. And if you really want to make money from your app, bear in mind that adverts only play a 12.4% part in the total purchasing decision of a mobile game customer - if you really want to do well, you'd turn your attention to creating positive social media and word of mouth which combined influences 35% of the purchasing decision. The themes of user acquisition, user retention and the lifetime value of a user were prevalent in Day One of the conference. As an FYI for developers, if you have 50% user retention after Day 1 you are doing stunningly - average market figures suggest that the first day retention rate is more likely to be at about 25 - 30%. Also much discussed was the question of the merits of the free-to-play versus premium versus "freemium" business model - and perhaps the most sensible and simplest suggestion on this matter came from Dan Gray of ustwo - pick a business model, design the game to suit the chosen business model perfectly and then stick with it. Different business models have such different design needs that changing models post-launch is asking for trouble.

 

Day Two of the conference focused more on technological opportunities and this is where ARM stepped in to give both a presentation and a panel discussion. However, I would like to dwell for a paragraph on the very good opening keynote by Robert Tercek, the Chairman of Creative Vision Foundations. He contemplated the changes that he has seen over the past ten years not only to the mobile device market, but also to society thanks to the mobile device market - 69% of 18 - 30 year old's publish videos online; 350m photos are uploaded to Facebook each day; 100hrs of video are uploaded to YouTube every minute; that picture of the announcement of Pope Francis. Mobile devices have become inseparable with our daily lives. In so many ways they acts as a sixth sense, helping us to take in and remember our environment and activities. And Robert encouraged the game developers to take this idea even further. Up until recently there has always been distinctly separate "digital experiences" and "real-world experiences". Why can't game developers help the two to combine? Overlay the digital world over the real one to accentuate the experience? And as the Internet of Things heats up, mobile game developers need to start taking advantage of the fact that the smartphone will become the central interface between the user and the world around them - there is a wealth of opportunity out there for the innovative.

 

Back to the ARM talks.  Following a whistle stop tour through the latest ARM technologies that are becoming available to mobile game developers (64 bit, big.LITTLE, 4K content to name a few), Nizar Romdan was joined on stage by Peter Parmenter of EA, Chris Doran of Geomerics, Sameer Baroova of PlayJam, Marcus Kruger of Goo Technologies and Will Freeman of Develop Magazine for a half hour discussion on "The Future of Mobile Gaming Technologies". The most interesting point I picked up from this discussion was the potential future market of HTML5 games on SmartTVs - as SmartTVs become more and more prevalent (yet predictably fragmented), the browser shines out as the most stable and dependable platform for gaming across all of the solutions on the market. Yet mobiles were confirmed to be the primary gaming device in the future - preferable above console, above PC, above SmartTVs. Current console gamehouses such as EA and ActiVision have seen this and are responding rapidly, developing their own games for mobile. Facebook mentioned the day before that the company had already made the mental switch to prioritizing mobile over PC. The entire gaming industry is moving to mobile and as technology advances, it will only be more and more confirmed as a standard in our daily lives.

 

At the end of the day, Will Freeman's quote summed up the mood of Mobile Games Forum 2014 perfectly for me: "There's a few challenges, but a hell of a lot of opportunities". Good luck to all the game developers out there!

 

 

IMG_0011.JPGIMG_0016.JPGIMG_0014.JPG
Nizar Romdan presents the gaming possibilities available on ARM hardware

 


Meanwhile, Phill Smith demonstrates the graphics

capabilities of ARM® Mali GPUs

ARM, EA, Goo Technologies, Geomerics and PlayJam

muse on the future of mobile gaming

How to fill the object shaped holes in your scene

$
0
0

Ordinarily the blog posts I write relate directly to whatever I have been working on recently. This one was inspired, or rather, specifically requested from my friend and colleague in Developer Relations, Chris Varnsverry. He called me over the other day and said “You know this thing people keep doing?”

I replied that yes, I did know, I was one of the first to spot the problem.

To which he responded “Can you teach them not to?”

 

Interlude:

Did you know that male sparrows have markings on their chests to indicate their ranking in the literal pecking order of sparrow society? They look like little downward pointing triangles and they can only be seen with a special camera receptive to ultraviolet light.

Similarly, a standard video camera can be easily confused by shining a bright infrared light at it. Invisible to the human eye, infrared will still activate the CCD sensor array and make the picture come out with nothing but a bright white glare.

These two scenarios are real-life situations which show just how little we really pay attention to things that fall outside of our visual spectrum, until some kind of technology is introduced which make it visible.

 

alphaissues.png

Kind of like if you have an application showing graphics perfectly normally on a tablet, and then you plug it into a HDMI port and portions of the image on the TVscreen are just black. Sometimes whole objects, sometimes rectangular portions, sometimes curious nonsensical shapes formed by the overlapping of scene elements. Since we first spotted it we’ve also seen the problem arise on certain display driver hardware for the device screen, now that we know what to look for.

 

Developer Relations is often contacted by developers and hardware manufacturers with images that look like these, thinking that there is something wrong with either the chip or device. In fact it is neither. You may never have encountered this problem - I had been working in graphics for years before I saw it - and it can arise from the strangest of circumstances. If you code a lot of graphical applications I’d recommend you continue reading this blog post, because one day you may just suffer this problem, and the solution is simple – if you know what the cause is. Using a tool like the ARM® Mali™ Graphics Debugger you can take a frame buffer direct from the GPU driver and look at the things going on in the alpha channel using image editing software. This is how we first spotted the problem.

 

People often think of alpha channels as a useful part of a texture and ignore them at all other times but the actual display frame buffer can have an alpha channel too. Most of the time you never see it because the display driver simply ignores it, but it is still there and not EVERY display driver ignores it.

So if you ever find yourself seeing parts of your image inexplicably blacked out or darkened, you should take a look at the following areas in your code:

 

1. Blend Modes

When doing alpha blending you probably use the line:

 

glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);

 

Which is fair enough, except that it can have an interesting effect on the resultant alpha in the frame buffer. If your source is semi-transparent, say 1.0, 1.0 ,1.0, 0.5 (RGBA) and your destination is RGBA 0.0, 0.0, 0.0, 1.0 the resultant blend will be 0.5, 0.5, 0.5, 0.75.

That 0.75 will be your undoing.

 

Most of the time, unless you’re reading back from the tile to do clever post processing or including GL_DST_ALPHA in your blend function, you don’t want the alpha channel of the frame buffer to ever be anything but a fully opaque 1.0.

Therefore you’d be better served by the line:

 

glBlendFuncSeparate(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, GL_ZERO, GL_ONE);


Which sets up standard alpha blending on RGB but maintains the destination alpha, which should be cleared to 1.0.


2. Shader Fixes

So when you’re done alpha blending, or if you never did it, chances are your blend mode has been set to:

 

glBlendFunc(GL_ONE, GL_ZERO);


So whatever you output from your shader is going directly onto the frame buffer. At this point, anything you output to your framebuffer’s alpha channel other than 1.0 will pollute it with unwanted transparency. Some people make the mistake of writing shaders which calculate the RGB values perfectly and then pad it out into an RGBA value thusly:

 

gl_FragColor = vec4(rgbCol,1.0);

 

This is totally fine, so long as the alpha at the end is 1.0.

 

If, however, you’ve accidentally written:

 

gl_FragColor = vec4(rgbCol,0.0);

 

On the majority of display drivers will look no different, until you hook your tablet up to a monitor and the HDMI encoder decides to use the alpha channel for its own nefarious purposes (usually just blending alpha to black).

 

3. The Thermonuclear Option

There comes a time in the development life cycle of every project where the code is, frankly, getting a bit big. This is usually near the end, and may be around the point where you first start testing platform compatibility to discover that, connected to HDMI, some devices black out bits of the scene. If you’ve been paying attention, you will probably know by now that this is an alpha problem, but if your code base is particularly massive you’ll want to check that alpha really is to blame before picking your code apart.

 

What I’m about to suggest will work but may have a minor performance impact, as you’re essentially fixing the alpha in post processing. The fix is this:

 

glClearColor(0.0f, 0.0f, 0.0f, 1.0f);

glColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_TRUE);

glClear(GL_COLOR_BUFFER_BIT);

 

Adding these three lines at the end of the frame will leave the RGB of the scene intact but completely replace the alpha channel with full opacity. This will immediately fix all alpha problems.  Obviously you’ll need to turn the mask and clear colours back again before you draw the next frame. If this works but you don’t want that extra draw step (it’s not a massive hit, but it’s not completely free in performance terms), get rid of this step and fix them properly.

 

 

Finally, just in case I don't get to post again before it, I look forward to seeing you at GDC for more graphics techniques and tips.

The Mali GPU: An Abstract Machine, Part 1

$
0
0

Optimization of graphics workloads is often essential to many modern mobile applications, as almost all rendering is now handled directly or indirectly by an OpenGL ES based rendering back-end. One of my colleagues, Michael McGeagh , recently posted a work guide [http://community.arm.com/docs/DOC-8055] on getting the ARM® DS-5™ Streamline™ profiling tools working with the Google Nexus 10 for the purposes of profiling and optimizing graphical applications using the Mali™-T604 GPU. Streamline is a powerful tool giving high resolution visibility of the entire system’s behavior, but it requires the engineer driving it to interpret the data, identify the problem area, and subsequently propose a fix.

 

For developers who are new to graphics optimization it is fair to say that there is a little bit of a learning curve when first starting out, so this new series of blogs is all about giving content developers the essential knowledge they need to successfully optimize for Mali GPUs. Over the course of the series, I will explore the fundamental macro-scale architectural structures and behaviors developers have to worry about, how this translates into possible problems which can be triggered by content, and finally how to spot them in Streamline.

 

Abstract Rendering Machine

 

The most essential piece of knowledge which is needed to successfully analyze the graphics performance of an application is a mental model of how the system beneath the OpenGL ES API functions, enabling an engineer to reason about the behavior they observe.

 

To avoid swamping developers in implementation details of the driver software and hardware subsystem, which they have no control over and which is therefore of limited value, it is useful to define a simplified abstract machine which can be used as the basis for explanations of the behaviors observed. There are three useful parts to this machine, and they are mostly orthogonal so I will cover each in turn over the first few blogs in this series, but just so you know what to look forward to the three parts of the model are:

 

  • The CPU-GPU rendering pipeline
  • Tile-based rendering
  • Shader core architecture

 

In this blog we will look at the first of these, the CPU-GPU rendering pipeline.

 

Synchronous API, Asynchronous Execution

 

The most fundamental piece of knowledge which is important to understand is the temporal relationship between the application’s function calls at the OpenGL ES API and the execution of the rendering operations those API calls require. The OpenGL ES API is specified as a synchronous API from the application perspective. The application makes a series of function calls to set up the state needed by its next drawing task, and then calls a glDraw[1] function — commonly called a draw call — to trigger the actual drawing operation. As the API is synchronous all subsequent API behavior after the draw call has been made is specified to behave as if that rendering operation has already happened, but on nearly all hardware-accelerated OpenGL ES implementations this is an elaborate illusion maintained by the driver stack.

 

In a similar fashion to the draw calls, the second illusion that is maintained by the driver is the end-of-frame buffer flip. Most developers first writing an OpenGL ES application will tell you that calling eglSwapBuffers swaps the front and back-buffer for their application. While this is logically true, the driver again maintains the illusion of synchronicity; on nearly all platforms the physical buffer swap may happen a long time later.

 

Pipelining

 

The reason for needing to create this illusion at all is, as you might expect, performance. If we forced the rendering operations to actually happen synchronously you would end up with the GPU idle when the CPU was busy creating the state for the next draw operation, and the CPU idle while the GPU was rendering. For a performance critical accelerator all of this idle time is obviously not an acceptable state of affairs.

gles-sync.png

 

To remove this idle time we use the OpenGL ES driver to maintain the illusion of synchronous rendering behavior, while actually processing rendering and frame swaps asynchronously under the hood. By running asynchronously we can build a small backlog of work, allowing a pipeline to be created where the GPU is processing older workloads from one end of the pipeline, while the CPU is busy pushing new work into the other. The advantage of this approach is that, provided we keep the pipeline full, there is always work available to run on the GPU giving the best performance.

 

gles-async.png

 

The units of work in the Mali GPU pipeline are scheduled on a per render-target basis, where a render target may be a window surface or an off-screen render buffer. A single render target is processed in a two step process. First, the GPU processes the vertex shading[2] for all draw calls in the render target, and second, the fragment shading[3] for the entire render target is processed. The logical rendering pipeline for Mali is therefore a three-stage pipeline of: CPU processing, geometry processing, and fragment processing stages.

 

gles-mali.png

 

Pipeline Throttling

 

An observant reader may have noticed that the fragment work in the figure above is the slowest of the three operations, lagging further and further behind the CPU and geometry processing stages. This situation is not uncommon; most content will have far more fragments to shade than vertices, so fragment shading is usually the dominant processing operation.

 

In reality it is desirable to minimize the amount of latency from the CPU work completing to the frame being rendered – nothing is more frustrating to an end user than interacting with a touch screen device where their touch event input and the data on-screen are out of sync by a few 100 milliseconds – so we don’t want the backlog of work waiting for the fragment processing stage to grow too large. In short we need some mechanism to slow down the CPU thread periodically, stopping it queuing up work when the pipeline is already full-enough to keep the performance up.

 

This throttling mechanism is normally provided by the host windowing system, rather than by the graphics driver itself. On Android for example we cannot process any draw operations in a frame until we know the buffer orientation, because the user may have rotated their device, changing the frame size. Surface Flinger — the Android window surface manager – can control the pipeline depth simply by refusing to return a buffer to an application’s graphics stack if it already has more than N buffers queued for rendering.

 

If this situation occurs you would expect to see the CPU going idle once per frame as soon as “N” is reached, blocking inside an EGL or OpenGL ES API function until the display consumes a pending buffer, freeing up one for new rendering operations.

 

gles-mali-throttle.png

 

This same scheme also limits the pipeline buffering if the graphics stack is running faster than the display refresh rate; if the GPU is producing frames faster than the display can show them then Surface Flinger will accumulate a number of buffers which have completed rendering but which still need showing on the screen; even though these buffers are no longer part of the Mali pipeline, they count towards the N frame limit for the application process.

gles-mali-vsync.png

As you can see in the pipeline diagram above, if content is vsync limited it is common to have periods where both the CPU and GPU are totally idle. Platform dynamic voltage and frequency scaling (DVFS) will typically try to reduce the current operating frequency in these scenarios, allowing reduced voltage and energy consumption, but as DVFS frequency choices are often relatively coarse some amount of idle time is to be expected.

 

Summary

 

In this blog we have looked at synchronous illusion provided by the OpenGL ES API, and the reasons for actually running an asynchronous rendering pipeline beneath the API. Tune in next time, and I’ll continue to develop the abstract machine further, looking at the Mali GPU’s tile-based rendering approach, and the shader core architecture it enables.

 

Comments and questions welcomed,

Pete

 

Footnotes

 


Is the future as good as it used to be?

$
0
0

Today the ARM® Mali™-DP500 Display Processor was launched alongside the ARM Cortex®-A17.

 

We are all aware that the resolution of screens is increasing: from Full HD smartphones and Quad HD tablets to Ultra HD TVs. However, simply increasing the number of pixels on the display does not result in a proportionately  improved viewing experience if, at the same time, we do not also upgrade the capabilities of the hardware driving the screen.

 

Mali-DP500 is a high specification Display Processor supporting multi-layer composition, scaling, colour conversion and control for a wide range of displays up to 4K - all packed into a very small silicon area.

 

So, why did ARM choose to invest in this new product area?

From talking to our partners we have seen an increasing challenge in this space.

 

Imagine the scenario. You’ve just got samples of your SoC back. The pressure is on to deliver that first demo at CES and you have an impressive list of target customers. Your software team have been looking forward to this day for months – the day they bring the SoC to life. However, early on in the process you run into a number of challenges:

 

  • Your video processor does not have its own MMU and on top of that your engineers do not know the ideal memory alignment for each of the components. Negotiating the memory allocation to get an acceptable level of efficiency is taking a considerable amount of time.
  • Next, your composition engine doesn’t support Android™ Pre-multiplication or Fences and this is causing major system bottlenecks. Once again, you have to put your smartest software guys on solving this.
  • Then your lead engineer tells you he doesn’t know “which way is up”. OpenGL® ES origin is bottom left of the screen. Display origin is top left. He’s right, which way is up?
  • When you finally get a passable Android ported to this platform, a new version of Android hits the market which requires some major software changes. You need your best software guys again.

 

That demo at CES is definitely at risk: delivery dates are slipping and your customers are starting to look to your competition.  And fundamentally, you have had no time or engineering bandwidth to create a differentiated product.

 

Sound familiar?

 

I’m reminded of a song by an obscure 1990’s Baltimore band The Tinklers called “The future is not as good as it used be”. They bemoan the fact that their childhood vision of a future world - where we all speak Esperanto, eat pills with 16 inch pizzas inside and only work 3 days a week because robots do everything for us – just didn’t turn out that way.

 

The ARM Mali-DP500 is a strong addition to our multimedia processing family.  It facilitates the use of important, bandwidth-saving technologies like AFBC across all major processing elements and simplifies the deployment of end-to-end security for our partners.

 

However, when you combine this with our software support it makes a game-changing solution for customers. ARM is now able to provide a complete Android multimedia software stack – drivers optimized for ARM CPU, GPU, Video and Display. This software stack will reduce the time that partners spend addressing challenges that are common across the industry - like pre-multiplication,  Android Fences and the latest Android releases. Having a complete set of multimedia processing units and associated software which will all work together efficiently straight out of the box means our partners can spend more time on differentiating their products. davidsben's accompanying blog Do Androids have nightmares of botched system integrations? expands on this and really is a must-read.

Integrated Android Stack.png

Here’s another scenario. Imagine you are entering digits into your phone to transfer some money; in a poorly secured device a rogue application can take over the screen and “trick” you into typing more 0’s simply by displaying only every other zero. You end up transferring $10,000 instead of $100. It’s that song again!

 

Another distinctive feature of the ARM Mali-DP500 is the built-in support for ARM TrustZone® technology. TrustZone enables secure transactions on your mobile device in situations such as payment. The ARM Mali-DP500 can be programmed from a Trusted Execution Environment (TEE) so that a secure screen which cannot be overwritten by a rogue software application is always on the top of the display. This gives a user confidence that what they see on screen is what they expect and prevents software attacks from compromising your phone or tablet.

TrustZone.png

When ARM is your partner, we provide all the advanced and power efficient multimedia hardware and software you need to build upon, freeing you up to innovate and create novel products that change the world. So, is the future as good as it used to be [1] – no of course not, it’s better!

 

- Chris

 

[1] Ironically, the only part of the song to come true is the line “picture telephones that let you look at to whom you are talking”

Do Androids have nightmares of botched system integrations?

$
0
0

Today ARM launched the ARM® Mali™-DP500 Display Processor, the first high specification Display Processor that ARM has developed. Alongside the Display Processor itself, ARM will also be delivering an optimized software driver for Android™’s HW Composer HAL, providing a buttery smooth user experience for your Mali-DP500-based Android devices.

 

Adding the Mali-DP500 into ARM’s media processing portfolio, as explained below, makes a giant leap forward in the system integration story of the platforms you build with ARM’s products; but for ARM’s software driver team, it’s just another step along a path we've been following since Android Éclair (2.1) first sparked to life on an ARM Mali-200 development platform in 2009.

 

Since that day, almost five years ago, ARM’s software driver team have added support for eight different versions of Android (from Éclair to KitKat) into thirteen releases of our driver stacks running on top of seven generations of Mali GPU hardware and they have also helped deploy those drivers into literally hundreds of different devices, including working on the bleeding edge of Android for Google’s Nexus 10. This has been no small accomplishment and in the process of supporting Android for so long our software team has built up a huge amount of experience in the OS, not just in the GPU sub-system, but in the entire media sub-system, covering rendering, composition, video and camera, within which the GPU plays only one (albeit large and critical) part.

 

We’ve seen the struggles that customers can have trying to integrate a video from one vendor, camera ISP from another, GPU from ARM and in-house display solution together into Android and have worked with and guided many of them towards solutions that transform their systems into first-rate, truly beautiful Android experiences. Through these different customer experiences we've developed a clear understanding of what a fully integrated set of media sub-system drivers must look like and what features they must support in order to achieve the best performance and power results that any one platform is capable of on Android.

 

So why is system integration so difficult?

 

In this case I'm hoping that a picture is worth a thousand words or, for this picture in particular, a thousand or so man-hours of your engineers’ time spent on something else instead of system integration:

 

android_stack.png

 

The diagram above is a simplified (yep, in reality it’s worse than this), view of the interactions between Android’s common user-space components and the underlying software drivers, kernel components and hardware that is used to provide the user experience on Android. If you take each of your media components from different vendors then what you end up with is three (or more) software drivers that you first need to integrate into your platform separately after which you also must integrate them with each other in order to get decent system performance. If you get the integration wrong or if the different components don’t talk using the same standard interfaces then what you’re left with is a functional platform that runs too slow, or burns too much power, or in the worst case somehow manages to do both at the same time.

 

ARM has taken each of the common integration headaches it has seen happen time and time again on customer platforms and designed them away by producing a collection of drivers designed to integrate and perform together. Let’s have a look at the real issues we see customers facing and how our pre-integrated solution avoids them:

 

Pixel format negotiation(“My system components don’t talk the same language!”)

 

One of the key concerns during system integration is making sure each component in the media sub-system (be it, Video, GPU, Camera or Display) is actually capable of understanding the format of the graphical output from the other components it reads from as well as ensuring each is capable of generating content in a format other components can read:

  • Your video hardware may be capable of writing out video frames in five different YUV formats but if none of them are supported by your Display Processor then you have no choice other than to burn some GPU power to compose that video onto the display.
  • What if you’ve accidentally implemented a display processor that doesn’t understand pixel formats that have pre-multiplied alpha values (as used by most of Android’s user interface)? Suddenly your super clever display processor is nothing more than a glorified framebuffer controller, scanning out frames your GPU has had to generate for you.
  • What if your components are all able to understand 32-bit RGBA pixel formats perfectly but for some reason some of your apps are displaying with inverted colors? Now you’re wasting days of engineering time tracking down which component disagrees with everything else about the ordering of the Red and Blue components of 32-bit pixel formats as well as figuring out how to make it flip them the other way.

These are just some of the examples of real integration issues we’ve seen happen all of which are avoided by using our complete solution: each ARM component has been designed to work with each other as well as work within Android so you won’t get any last minute confusion over whether they all speak the same language. What’s more, ARM provides, alongside it’s software drivers, an open source allocation library (Gralloc) that is already setup with support for each of our components ensuring bring-up time is reduced even further.

 

Memory allocation negotiation (“My system components don’t talk to each other”!)

 

Another area which causes many issues is deciding where and how to allocate memory for the system’s graphic buffers. When you allocate memory you need to take into account the various constraints of the underlying hardware that is to access that memory. Some key questions that you have to be able to answer when integrating the components together are:

  • Do all my components have an sMMU? If not then for certain allocations you’ll be forced to allocate some memory as physically contiguous to ensure it can be read by all components.
  • What’s the ideal memory alignment for all of the targeted components? Without this knowledge for every component in the system you could end up making very inefficient memory accesses when processing the graphic buffers.
  • Is there a certain area of memory that some components cannot access? Or an area that they must access from?

 

The Gralloc library provided by ARM has built-in understanding for all the system constraints of ARM’s multimedia processors and can work together with the Android Kernel’s ION allocator to ensure the most appropriate and memory efficient allocations are made for each processor in the system.

 

In addition, each of the software drivers for ARM’s multimedia processors utilizes the standard Linux dma_buf memory sharing features. By ensuring that all of the drivers use the same interface, the same allocation can be written to by one processor and read from by another providing a “zero copy” path for all graphical and video content on the platform, ensuring that the memory bandwidth overhead remains as low as possible.

 

Synchronization(“My system components talk over each other!”)

 

When you have a “zero copy” path in your system, and two or more devices are using the same piece of memory directly, synchronization between those components becomes extremely important. You don’t want your display processor to start reading in a buffer before the GPU or Video processor has finished writing to it  or you’ll end up with some very nasty screen corruption.

 

In older versions of Android (before Jellybean MR1), synchronization was handled and controlled in the Android user space by the way of each component in the rendering pipeline performing the following steps: processing its commands in the software driver, performing its task in the HW, waiting for that task to complete in the SW driver and then passing responsibility onto the next stage of the pipeline. This allowed for a very simple and easy synchronization method between components but also caused large bubbles (gaps) in the rendering pipeline as you’d continually stall and ping-pong work between the CPU and the HW and you wouldn’t start the CPU processing of the next stage until the HW processing of the previous stage had completed. All of these pipeline stalls could mean the difference between a “buttery smooth” and a “stuttering along” end user experience.

 

With Jellybean MR1, a new synchronization method, Android fences, was added to the Android platform. These fences, provided the software driver supports them, allow each stage in the pipeline to do their CPU-side processing and queue work for their component even if the previous stage hasn’t finished in the hardware as well as pass control to the next stage of the pipeline before their own hardware processing has completed (or even begun). As each component in the pipeline completes its work, it signals a fence and the next stage is automatically triggered with as little CPU involvement as possible. This allows much smaller gaps between one piece of hardware completing and the next one in the chain starting, squeezing out every last bit of performance possible for your systems. In order to make full use of the benefits of Android fences, every component in the rendering pipeline needs to support them. If one of your components does not then that stage in the pipeline falls back to waiting in user space and a performance bubble is introduced into your system.

 

The big problem, however comes when all of your components support Android fences but one of them has a bug. The only way that these bugs will manifest is as a sudden graphical glitch in the system that almost instantly disappears. What do you do now? You’ve got three or more different vendors providing software drivers that all support Android fences and one of them has a bug, but how do you know which one? How do you track it down?  Before you know it you've had to kick off three separate investigations with your vendors to try and find a bug that only manifests when one vendor’s component uses a standard interface to communicate with another vendor’s component. These kind of bugs can be extremely difficult to find, especially when no single vendor will know anything about the other vendor’s software. They’re the kind of bugs that stretch your device's release date further and further out as you wait for someone to have a eureka moment. Luckily, this isn't your only option; if you've taken the complete solution from ARM then you already have a set of drivers that have been implemented and even validated together to run correctly and if you did find an issue there’s only one vendor you need to talk to and you can be confident that they already have all the expertise needed to find it and fix it quickly.

 

Efficient composition

 

System integration isn't all about avoiding problems, it’s also about bringing a number of components together to achieve more than they ever could do alone. ARM’s new Mali-DP500 is a product tuned precisely for getting the most out of Android by efficiently offloading composition work from the GPU where it counts. We've performed detailed investigations into the common composition scenes generated by Android and seen that most applications and games found on the Android market make use of three or fewer layers with four or five layers usually being the upper limit produced.

 

image008.pngimage009.png

 

Any composition engine must trade off the number of hardware layers it supports and silicon area/cost. When composing a frame with more layers than supported in hardware any additional layers are typically handled by flattening them together using the GPU. The Mali-DP500 software drivers handle this by having complete control over which layers are sent to the GPU to be flattened, this allows us to leverage our expert knowledge of how to get the best performance out of our GPUs. The Mali-DP500 software driver will make intelligent decisions, based on the scenes being generated by Android, about what to send to the GPU in order to use as little bandwidth and power as possible compared to doing the full composition on the GPU.

 

When technologies such as Transaction Elimination are deployed in the system’s GPU the Mali-DP500 software driver can ensure that the GPU is processing only static or infrequently changing layers, effectively reducing the amount of memory bandwidth used by the GPU to write those layers to memory to near zero; and when coupled with AFBC technology in the GPU the memory bandwidth used, even in cases where the GPU must actually process non-static content, is greatly reduced.

 

Conclusion

 

So there you have it. With the addition of optimized drivers for both Display and Video which sit alongside what we’ve already been providing for the GPU, ARM is now able to offer a complete, off-the-shelf, set of drivers that come pre-integrated and optimized together and, most importantly, have all been validated to ARM’s highest quality standards to work seamlessly together *before* your platform is even ready to run software for the first time.

 

System integration on Android no longer has to be a trade off between stability and performance or become a cross-vendor organizational nightmare. Androids can peacefully dream about electric sheep once more.

ARM Mali Display - the joined-up story continues

$
0
0

For some time now I’ve been introducing myself in meetings as being responsible for future technology in graphics, video and display and finally I have something to talk about publicly. As my colleague Chris Porthouse says in his blog Is the future as good as it used to be?, ARM has just proudly launched the ARM® Mali™-DP500 Display Processor, available for licensing now, after a multi-year development.

 

History

Stephensons_Rocket_drawing.jpg

In the bad old days, when pretty much all you had to do was read in the frame buffer from DRAM and push it out in pixel order, the job of designing a “display controller” was usually given to a junior SoC engineer to allow him to cut his teeth. The bandwidth to DRAM was OK, screen resolutions and frame rates were low, the contention on the bus was manageable, security wasn’t a concern and image processing was limited to trying to get the colours of the pixels correct. Inevitably a quick respin of the FPGA was needed when someone’s interpretation of RGB turned out to be BGR when laid out in memory, but no-one was usually fired. The software was often no more complex than a short driver that programmed a few registers, and high-level operating systems with complex GUIs were mostly unknown or irrelevant to the world of battery-powered devices.

 

 

 

 

Modern timesKyusyushinkansen_type800_shinminamata.jpg

These days, driving a 1080p screen at 60 frames per second means you are probably reading an absolute minimum of 250 Mbyte/s from DRAM to display and it could easily be a lot more. There are many other masters on the bus in a modern SoC and you can’t just rely on being able to get the bandwidth and bus priority you might want. What was originally a simple IP block has now become a relatively high-bandwidth bus master that needs to play nicely with the other masters. What was a simple controller has become a processor in its own right. On a modern smartphone, multiple image layers from a still camera, decoded digital video, 3-D graphics and other sources will need to be programmed to blend and composite together to produce the final screen image and image formats may need to be blended together and formats converted on the fly. Even phones might have multiple screens, supporting a second over HDMI or Miracast or other WiFi-based display. Those screens might display completely different images, or they might display the same image, but differently scaled. In addition, tablets and other devices are scaling to 4K screens, users are expecting the screen image to rotate when they rotate their devices, and secure display is a concern for many when dealing with financially sensitive data. Add all that to the OS integration work required that Chris talked about, and are you still happy to have the new graduate do it all? No, we didn’t think so, and that’s one of the reasons why we produced the ARM Mali-DP500 product – designing and validating a display processor (with all the associated software work) that meets modern requirements is a significant piece of work and it makes economic and schedule sense to buy in the IP rather than do all that work in-house.

 

The other major reason for ARM deciding to produce a display processor product is technical. When you have multiple pieces of media processing IP that are designed to work together, we can take advantage of that to improve efficiency and save power. Just like ASTC (bloggedextensively) which is a texture compression method used in GPUs to save memory bandwidth, we also have ARM Frame Buffer Compression (AFBC) used across multiple pieces of IP. AFBC is a lossless compression method invented by ARM, also used to compress/decompress images to save memory bandwidth (and thus SoC power), but without losing any detail or quality whatsoever. When we launched the ARM Mali-V500 video processor, Ola Hugossonblogged about how we use AFBC in Mali-V500 internally - for reference frames, reducing bandwidth, as well as using it to produce the final image compressed. In Ola’s blog there’s a graph of how much memory bandwidth can be saved. To gain maximum advantage, you need a display processor that can read the compressed images, and decompress them in addition to all the other features described above. In the case of using an AFBC-enabled display processor, you can save hundreds of Mbyte/s decoding a 4K stream and even with modern memory systems that adds up to a significant power saving. The graph is repeated here and the difference between the green line and the red line is the use of an AFBC-enabled display processor.


jems blog picture 3.jpg


Partners want all IP blocks to use a common, lossless compression format so that data can be interchanged seamlessly between them in the most power- and memory-efficient way, so obviously the new Mali-T760 GPU also supports AFBC. We have joined-up product families of CPUs, GPUs, video processors, and now display processors that can utilise common technologies across the system to save power. TrustZone is another technology we use across the processor families to create security and content protection solutions and we have other technologies being worked on at the moment which will increase the advantages our partners gain if they take multiple IP blocks from us, but I have gone on long enough for now…

 

I’ve blogged before about the way in which we have joined-up technical strategies across our product lines. Funnily enough it’s something I spend a lot of my days doing. For example, at the time of its launch, many people asked why the Mali-T600 family of GPUs was able to use more than 32-bits-worth of memory. Then ARM produced CPUs that can access it as well and they started to get it. Now of course 64-bit addressing has suddenly become the new black and we’re able to demonstrate systems with 64-bit addressing being used across the system.We’re in a much more advanced position on this than many of our competitors. Joining ARM’s IP together is much, much more than simply defining AMBA bus interconnect standards (great though that is). In this blog, I hope I’ve given you a flavour of how we work on driving advantages and optimisations from that joining-up, why the ARM Mali-DP500 will be a fantastic component of our partners’ systems and why that will help our partners make better products.

The past, present and future of mobile photography

$
0
0

Image1.pngLast week I had the great opportunity to deliver a talk at the Electronic Imaging Conference in San Francisco. The conference track was introduced by Michael Frank of LG Electronics, who gave a very interesting historical backdrop to parallel computing technologies and the importance of heterogeneous computing. Philip Rogers and Harris Gasparakis of AMD followed with a fine overview of the rationale and merits of the Heterogeneous System Architecture and its implications for image processing and computer vision.

Soon after that it was my turn. I could have talked for days, but as I only had a 45 minutes slot I opted to focus my talk on mobile computing and the taking and processing of photo images.

 

 

 

(Pre) history of the camera phone and the rise of smartphone photography

 

Many attribute the birth of the camera phone to Philippe Kahn, a technology pioneer who broadcasted the birth of his daughter in June 1997 through a small camera integrated into a Motorola phone. The first commercial phones with camera started appearing in the year 2000 with resolutions as low as 0.11 MPix and limited storage capability (only 10s of photos). Initially, not all devices supported browsing and wireless electronic sharing, in some cases the user had to plug the phone to a PC via cable and download its pictures.

By 2002, resolutions had grown to 0.3 megapixels, the camera phone had reached the U.S. shores, and features such as white balance control, self-time, digital zoom, and various filter effects were making their way into handsets.

Then the race of the megapixel started.  In 2004, the one megapixel barrier was smashed, making photo quality sufficient for small prints - and some handsets allowed you to even take up to 8 photos in a burst shot!

Resolutions quickly ran up to two, three and five megapixels and features such as LED flash, professional brand optics, image stabilization, autofocus and in some cases even futuristic rotating screens started to appear.

You can find more details in this great article.

 

Today, smartphone photography is gradually becoming the most popular form of photo taking with many people using their phone as the primary camera. By 2011 over a quarter of photographs were taken on a mobile phone (NDP Group, 2012) and 18% of consumers considered their phone to be their primary camera (Consumer Electronics Association, 2012). A more recent study has seen this trend consolidate, with 58% of NA Smartphone photo taking consumers exclusively take photos with their phone in 2013. This, in turn, stimulates photo sharing. For example, when Instagram launched, over 100k people signed up in the first week, and in 2013 over 150M active users were registered.

Developers are starting to take advantage of these trends. A quick search for camera apps in the Google Play™ store returned several hundred applications:

 

Image2.png

 

The Spring of Computational Photography

 

Computational photography refers to computational image capture, processing, and manipulation techniques that enhance or extend the capabilities of digital photography.

The computational photography features of a modern smartphone are impressive:

  • Multi image processing:  picture-in-picture dual shot, “beautification”, “best face” photo composition, HDR, 3D panorama.
  • Real time processing: stabilization, de-noising, face and features detection, motion capture.
  • Post production: advanced image filters, occlusion removal, 2D to 3D conversion.

 

However, for all the great technology you put in the sensor, etc, mobile devices are still limited by size, form factor and cost considerations. The sensor and lens on a phone are never going to match a full size LSR. Then on top of that you have to compensate with outside factors such as challenging lighting conditions and skill (or lack of) of the photographer.

 

Image3.png
Some truly innovative software solutions have recently emerged to resolve these challenges, and I focused on these for the rest of my talk.

 

Image sensors and the role of the GPU

 

Over 1 billion higher quality camera phone modules (at least 3 megapixel resolution) will ship in 2014 and over half of these modules will have at least 8 megapixel sensors. The camera phone module market was $10.8 billion in 2012, and is expected to have grown to $13.3 billion in 2013, or 24% per year (IC Insights, 2013; Research in China, Global and China CMOS Camera Module Industry Report, 2012-2013).

One of our many partners in this market, Aptina Imaging, has taken this technology even further.  They prototyped a platform consisting of their latest interlace HDR sensor set up to feed raw data to the Samsung Exynos 5250 SoC on an Insignal Arndale Board. They then offloaded  the whole image post-processing pipeline to the ARM Mali-T604 GPU using OpenCL™,  including noise reduction, iHDR reconstruction, tone mapping, color-conversion, gamma and de-noising.

Why is this approach interesting? A software solution can offer more flexibility and enables algorithm modifications right up to the release of the consumer device. Sensor and camera module vendors can invest in optimized portable software libraries instead of hardware. One camera module and sensor configuration can be used to address a broader market requirements. At the same time SoC implementers can reduce device costs by offloading selected ISP blocks to the CPU + GPU subsystem, and optimize/balance interoperation between traditional ISP processing and the GPU.

 

Improving interoperations

 

A critical aspect of heterogeneous computing is the communication of data between the various processors. OpenCL implements a host-device architecture, where your application prepares data buffers then hands them over to the device (worker) which, when it has finished, copies the data back. OpenCL has its roots in traditional GPGPU Computing, which expects discrete VRAM. Even if copies are not physically required (ie same physical RAM like in an ARM SoC), the programming abstraction and API still envisages copies (which may well be redundant). We have optimized our drivers to remove these overheads, by avoiding redundant copies.

 

Another problem can occur if we want to use OpenCL (which is a “general purpose” heterogeneous compute API) together with any graphics processing we are doing with OpenGL® ES, even if the latter is only used for rendering output frames. The default operation is to copy data between the two API contexts. This leads to further redundant copies, which can have very bad impact on performance.

Before interrops extensions support was introduced, we had to:

  • Read the required data into the GPU OpenCL memory space,
  • Perform all the required processing on the GPU,
  • Map the image back to CPU side,
  • Upload the data/image again to GPU memory on OpenGL ES side,
  • Render using OpenGL ES texture.

 

Interrops create a common memory space between the CPU and GPU (EGL™/OpenCL), and between graphics and compute APIs (OpenGL ES/OpenCL). The diagram below illustrates the 50% improvement in performance as well as the fact that we are now using almost the total of our application time doing useful work (as opposed as spending over a third of our time moving data around – which in addition to consuming time has a even more pronounced impact on energy consumption).

 

Image5.png

 

Further experimentations: auto-white balance

 

Each light source has its individual “colour” or “temperature”, which varies between red (sunset, candle light) and blue (bright Californian sky, but not this time of the year). The human eye automatically adjusts to this, so that it does not matter in what lighting condition you see the colour white, it will still look white. For cameras you either need to manually calibrate (use presets such as “daylight”, “indoors”, etc) or post-process the raw data at a later stage, or…  you use the Auto White Balance (AWB) setting and let the camera take care of it.

ARM experimented implementing AWB in software using OpenCL on an ARM® Mali™ T604 GPU. Our AWB algorithm uses the Gray World Assumption. This algorithm was chosen because popularity. The Gray World uses the average colour of the entire image to correct by a colour shift.

Our simplified GPU implementation is divided into three stages, as depicted by the diagram below:

  • First, a kernel calculates the R, G and B channel sums using atomic operations (accelerated in hardware). This stage operates on a smaller, down-sampled 8 bit image.
  • An average for each channel is then calculated by dividing the outputs from the previous kernel by the image size. This is then converted into YUV and pass to the next stage.
  • We work on the full-resolution 16-bit image. We convert every pixel from RGB to YUV and correct the colour based on stage 2’s outputs. We then converts the pixel from YUV to RGB and store it to the output buffer.

 

Image6.png

 

Our first version of the algorithm computed one pixel per work item and used INT16 data types. We then optimized this improving computation to four pixels per work item and we also rearranged the calculations to make better use of the ARM Mali-T600 GPU’s ALUs, resulting in a 7.4x improvement, resulting in a 1.5 order of magnitude performance uplift compared to our CPU-only reference implementation.  And we managed all of this in less than 90 lines of OpenCL code.

 

HDR, segmentation and more

 

Another ARM partner, Apical (who were also presenting at the event), have optimized their advanced tone mapping algorithms with OpenCL on Mali and were able to achieve close to a 20x uplift in performance compared to their CPU/NEON™-only implementation.

 

Harri Hirvola of the Nokia Research Center implemented a superpixel clustering algorithm based on SLIC (Simple Linear Iterative Clustering) using OpenCL.

 

Image1.png

 

Clustering is needed to do things like object extraction or removal for example. The implementation used five kernels and resulted in a significant improvement in GPU acceleration/offload.

 

Image8.png

 

Engaging with Partners

 

A fundamental part of ARM enabling new technologies is fostering a strong ecosystem. We work very closely with our customers, OEMs, third party software vendors and developers. In this context we supply developers and other interested parties with platforms and tools to explore and develop using the latest processing technologies such as GPU Compute on ARM Mali GPUs. To find out more about this, email: gpucompute-info@arm.com or check out http://malideveloper.arm.com.  To reach out to our many partners active in this field, register for the ARM Connected Community today and join ARM Mali Graphics.

It was a privilege for me to take part in the Electronic Imaging Conference and I would like to thank again Michael Frank and his team for asking ARM to take part.

ARM Mali GPU demos at MWC 2014

$
0
0

Mobile World Congress (MWC) is now only a week away and we are just finalizing the demos that are going to be shown.  I thought I would take this opportunity to let you know some of the new technologies that will be available on the ARM Booth (Hall 6, Stand 6C10).

 

Tablet and mobile silicon can easily be integrated into other products like STB or Smart TVs to smoothly render 4K content.  We will have a whole range of devices on show, from wearables through to 4K TVs,  all able to run the same content and show how scalable ARM® Mali™ GPUs really are.

 

Video codecs are a hot topic right now as we see HEVC and Google’s VP9 emerging.  Our partner, Ittiam Systems, have solutions for both codecs which make use of GPU Compute to improve performance while minimizing energy consumption.  These demos will be shown on ARM Mali-T604 and ARM Mali-T628 GPUs.  Many other Ecosystem partners will be showing codec demos at MWC, highlighting how important this use case is for mobile GPU Compute.

 

Still on the theme of GPU Compute, we will be showing an audience analysis demonstration that has been enabled through GPU Compute on the ARM Mali-T604 GPU.  When applied to an automotive environment, a car could recognise not only the driver (probably through which key has been used) but also, thanks to this solution, the passengers on board as well and automatically adapt the travelling experience to their needs.  Or how about signage that detects when no one is present and switches off to save power, but when someone is present it shows targeted advertising based on age and gender. If you add precise, GPU Compute-enabled gesture recognition, these advertisements instantly become interactive.  All of this is now possible with OpenCL™-capable GPUs like those in the ARM Mali-T6xx and T7xx series. 

 

ARM Mali-450 GPUs have started shipping in large volumes around the globe.  At the ARM Booth we have a selection of the latest products implementing these mid-range GPUs.   Last year, the latest Tier-1 mobile devices were showing off very complex OpenGL® ES 2.0 content; similar content this year will be running much smoother than ever before – and on entry level devices rather than premium smartphones.  Feel free to try the Alcatel Idol X+, based on the Mediatek MTK6592 containing an octa-core ARM Cortex®-A7 CPU and quad-core ARM Mali GPU, to see what is now possible with today’s mid-range phones.

 

Finally, I will be wearing the Samsung Galaxy Gear to show how wearable technology is also able to run high quality gaming in such a small form factor.  It always surprises anyone who sees this watch how capable it really is!

 

See you at the show.

Phill

Another Milestone for ASTC Texture Compression

$
0
0


Time sure flies when you’re having fun! It’s been more than two years since SIGGRAPH Asia 2011 in Hong Kong, where I had the pleasure of unveiling our Adaptive Scalable Texture Compression (ASTC) technology. A lot has happened on the ASTC front since then: we made many technical improvements, providing even better quality and finer control of bit rate; we published full technical details of the format at High Performance Graphics 2012; the Khronos Group ratified an ASTC extension; and the first consumer devices with ASTC support began to appear on the market. It fairly boggles the mind. Well, it does mine, anyway.

 

Why am I wandering down memory lane like this? Two reasons: One is that just before this past Christmas, I went back to Hong Kong for SIGGRAPH Asia 2013, this time to talk more generally about power and memory bandwidth reduction in the latest ARM® Mali™ GPUs. The other reason, and the main one, is that ASTC recently passed another milestone in its march toward ubiquity, as the Khronos Group ratified extensions covering the full functionality of the format.


Wait a minute, I thought…

 

You thought Khronos had already standardized ASTC?  It had, but not completely. Khronos released an extension called KHR_texture_compression_astc_ldr at SIGGRAPH 2012. However, that extension exposed only the low dynamic range (LDR, get it?) pixel formats of ASTC, and only for 2D images.  We did that because at the time, details of the high dynamic range (HDR) and 3D features hadn’t been nailed down, and some Khronos members weren’t sure they would work as well as we hoped – they were, after all, pretty revolutionary. Also, some Khronos members wanted to start implementing the 2D LDR features of ASTC right away, before we were ready to freeze the definition of the more advanced features.


Three Flavors of ASTC

 

I’m happy to say that, in the end, the HDR and 3D features of ASTC turned out to work very well indeed. Recognizing this, the Khronos Group recently ratified two new extensions, adding HDR and 3D functionality respectively.  The ASTC family now looks like this:

 

KHR_texture_compression_astc_ldr is the previously-ratified low dynamic range profile

KHR_texture_compression_astc_hdr extends the LDR profile to include HDR

OES_texture_compression_astc extends the HDR profile to include 3D textures

 

The extensions are layered, with each new layer requiring the previous layers, so if your implementation supports KHR_texture_compression_astc_hdr, all of the LDR features are supported too. If it supports OES_texture_compression_astc, it supports everything. If you try to use an HDR texture on an implementation that doesn’t support HDR, the LDR portions of the texture decode normally, and the HDR texels come back a lovely shade of radioactive pink.

 

You might be wondering about the extension name prefixes: why KHR-blah-ldr and KHR-blah-hdr, but OES-blah-astc? The OES prefix identifies an extension that is defined and ratified by the OpenGL® ES working group, for use with OpenGL ES. Extensions with the KHR prefix are ratified by both the OpenGL ES and desktop OpenGL® working groups, and can be used with either API. So, you can and will see ASTC LDR- and HDR-capable GPUs on desktop as well as mobile devices, but for the moment there’s no way to ship ASTC 3D textures on the desktop.  It’s too bad, but hey, OpenGL ES is shipping in a billion devices a year; and the desktop will catch up eventually.

 

What it all means

 

So, ASTC HDR and 3D are now available as Khronos standards. What does that mean? How does it make life better for mobile device manufacturers, or app developers, or users?

 

We’ve written at length about the technology– how ASTC offers developers unprecedented flexibility in bit rate and pixel format, as well as a substantial boost in image quality. And Sean has a great article describing how the HDR and 3D features of ASTC work, and why they’re useful – even, potentially, revolutionary. If you aren’t convinced by now, you aren’t going to be, so I won’t repeat that story here.


What Khronos standardization adds to the picture is that it puts ASTC on the road to becoming universally available. By placing the format under the Khronos IP umbrella, it removes the uncertainties that have prevented widespread adoption of proprietary formats like S3TC and PVRTC. It is also, obviously, a powerful endorsement of the technology. Add in the enthusiastic reception the format has received from developers, and the bottom line is that GPU vendors now have many reasons to support it in their hardware, and few reasons not to. ASTC has been available for some time now in the Exynos-based versions of the Samsung GalaxyNote 3, Note Pro and other devices, which feature the ARM Mali-T628 MP6 GPU. We understand that it’ll be supported in upcoming SoCs and IP cores from Qualcomm, NVIDIA, and Imagination Technologies as well. Other implementations are on the way.


OK, I can’t resist…

 

I said I wasn’t going to talk about ASTC from a technical point of view, but I can’t resist – after all, you can’t write a blog about texture compression without showing an image, can you? So here’s an image. Actually, here are two:

 

RMTeapotUncompressed.png

Figure 1: A (chocolate-free) teapot rendered using a 2MB volume texture

 

RMTeapotCompressed.png

Figure 2: The same teapot with the volume texture compressed to 151KB using ASTC.

 

What you’re seeing is an implementation of a procedural marble shader, taken from the AMDRenderMonkey™ examples. What’s interesting about it is that it’s not a 2D marble texture uv-mapped onto the surface of the teapot.  Instead, the shader samples a 3D noise function at every point on the surface, and uses the result to sample a 1D color gradient texture. The 1D texture is tiny, but the noise function is implemented as a 128x128x128 volume texture. The original 8-bit, single channel texture (used to produce the upper image) occupies 2 MB – not huge, but big enough to make you ask if you really need it, at least on a mobile device. The version in the second image uses the same volume texture, compressed using ASTC at 0.59 bits per pixel, which reduces it to 151 KB. Can you see the difference? I didn’t think so.

 

This is just a toy example, but I hope it shows how ASTC’s low-bit-rate 3D compression can change the game, making previously stressful or even unthinkable algorithms practical.  I can’t wait to see how serious game developers will make use of the technology, when it reaches them.

 

As always – got comments or questions? Got ideas for clever ways to use HDR or 3D textures? Drop me a line…

 

Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics.  He spends his working days thinking about what ARM GPUs will be used for in 2016 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.

A new feather in the HSA Foundation’s cap

$
0
0

You may recall a couple of blogs I’ve written which mentioned a group called the HSA Foundation™, something that I have invested a lot of interest in over the past two years. The HSA Foundation is a not-for-profit consortium of SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose worthwhile goal is to make it easy to program for parallel computing.

 

The first blog, announced the formation of the HSA Foundation in 2012 with ARM as a founding member; the second announced the first release from the Foundation, its Programmer’s Reference Manual; the third blog – this blog – proudly announces that, within only two years of creation, the HSA Foundation was been named the “Best Processor Technology” by The Linley Group.

 

The best technology category is slightly different from The Linley Group’s other Analysts’ Choice Awards in that it doesn’t have to relate to either a specific product or a specific quantitative measure. Instead, it represents the technology that they consider will make the greatest impact on the microprocessor industry. And this year they decided that this technology was that of the HSA Foundation.

 

Being granted this award within so short a time is testimony to the hard work and dedication with which the HSA Foundation is driving towards its goal. It is a much appreciated recognition of the achievements we have made so far in the field of heterogeneous computing and the potential that the HSA Foundation has to make a hugely positive impact in the future. In the words of The Linley Group, “We believe the HSA represents the best opportunity to offer high compute capability at the lowest power while still maintaining ease of programming. Working together, these vendors can build more-efficient SoC processors by enabling the CPU and GPU elements (and other programmable units like DSPs) to work together on parallel workloads such as image processing, computational photography, and speech recognition...we believe the HSA will have the most influence on future microprocessors”

 

At the end of May 2013, The HSA Foundation released Version 0.95 of its Programmer’s Reference Manual and later versions of that and the System Architecture Reference Manual are coming as well.

 

My thanks and congratulations go out to ARM’s fellow founders of the HSA Foundation: AMD, Imagination Technologies, MediaTek, Qualcomm, Samsung, and Texas Instruments, all of whom have helped to make this achievement possible.


The Mali GPU: An Abstract Machine, Part 2

$
0
0

In my previous blog I started defining an abstract machine which can be used to describe the application-visible behaviors of the Mali GPU and driver software. The purpose of this machine is to give developers a mental model of the interesting behaviors beneath the OpenGL ES API, which can in turn be used to explain issues which impact their application’s performance. I will use this model in the future blogs of this series to explore some common performance pot-holes which developers encounter when developing graphics applications.

 

This blog continues the development of this abstract machine, looking at the tile-based rendering model of the Mali GPU family. I’ll assume you've read the first blog on pipelining; if you haven’t I would suggest reading that first.

 

The “Traditional” Approach

 

In a traditional mains-powered desktop GPU architecture — commonly called an immediate mode architecture — the fragment shaders are executed on each primitive, in each draw call, in sequence. Each primitive is rendered to completion before starting the next one, with an algorithm which approximates to:

 

    foreach( primitive )         foreach( fragment )              render fragment


As any triangle in the stream may cover any part of the screen the working set of data maintained by these renderers is large; typically at least a full-screen size color buffer, depth buffer, and possibly a stencil buffer too. A typical working set for a modern device will be 32 bits-per-pixel (bpp) color, and 32bpp packed depth/stencil. A 1080p display therefore has a working set of 16MB, and a 4k2k TV has a working set of 64MB.  Due to their size these working buffers must be stored off-chip in a DRAM.

 

model-imr.png

 

Every blending, depth testing, and stencil testing operation requires the current value of the data for the current fragment’s pixel coordinate to be fetched from this working set. All fragments shaded will typically touch this working set, so at high resolutions the bandwidth load placed on this memory can be exceptionally high, with multiple read-modify-write operations per fragment, although caching can mitigate this slightly. This need for high bandwidth access in turn drives the need for a wide memory interface with lots of pins, as well as specialized high-frequency memory, both of which result in external memory accesses which are particularly energy intensive.

 

The Mali Approach

 

The Mali GPU family takes a very different approach, commonly called tile-based rendering, designed to minimize the amount of power hungry external memory accesses which are needed during rendering. As described in the first blog in this series, Mali uses a distinct two-pass rendering algorithm for each render target. It first executes all of the geometry processing, and then executes all of the fragment processing. During the geometry processing stage, Mali GPUs break up the screen into small 16x16 pixel tiles and construct a list of which rendering primitives are present in each tile. When the GPU fragment shading step runs, each shader core processes one 16x16 pixel tile at a time, rendering it to completion before starting the next one. For tile-based architectures the algorithm equates to:

 

    foreach( tile )         foreach( primitive in tile )              foreach( fragment in primitive in tile )                    render fragment

 

As a 16x16 tile is only a small fraction of the total screen area it is possible to keep the entire working set (color, depth, and stencil) for a whole tile in a fast RAM which is tightly coupled with the GPU shader core.

model-tbr.png

 

This tile-based approach has a number of advantages. They are mostly transparent to the developer but worth knowing about, in particular when trying to understand bandwidth costs of your content:

 

  • All accesses to the working set are local accesses, which is both fast and low power. The power consumed reading or writing to an external DRAM will vary with system design, but it can easily be around 120mW for each 1GByte/s of bandwidth provided. Internal memory accesses are approximately an order of magnitude less energy intensive than this, so you can see that this really does matter.
  • Blending is both fast and power-efficient, as the destination color data required for many blend equations is readily available.
  • A tile is sufficiently small that we can actually store enough samples locally in the tile memory to allow 4x, 8x and 16x multisample antialising1. This provides high quality and very low overhead anti-aliasing. Due to the size of the working set involved (4, 8 or 16 times that of a normal single-sampled render target; a massive 1GB of working set data is needed for 16x MSAA for a 4k2k display panel) few immediate mode renderers even offer MSAA as a feature to developers, because the external memory footprint and bandwidth normally make it prohibitively expensive.
  • Mali only has to write the color data for a single tile back to memory at the end of the tile, at which point we know its final state. We can compare the block’s color with the current data in main memory via a CRC check — a process called Transaction Elimination— skipping the write completely if the tile contents are the same, saving SoC power. My colleague Tom Olson has written a great blog on this technology, complete with a real world example of Transaction Elimination (some game called Angry Birds; you might have heard of it). I’ll let Tom’s blog explain this technology in more detail, but here is a sneak peek of the technology in action (only the “extra pink” tiles were written by the GPU - all of the others were successfully discarded).

     blogentry-107443-087661400 1345199231_thumb.png


  • We can compress the color data for the tiles which survive Transaction Elimination using a fast, lossless, compression scheme — ARM Frame Buffer Compression (AFBC) — allowing us to lower the bandwidth and power consumed even further. This compression can be applied to offscreen FBO render targets, which can be read back as textures in subsequent rendering passes by the GPU, as well as the main window surface, provided there is an AFBC compatible display controller such as Mali-DP500 in the system.
  • Most content has a depth and stencil buffer, but doesn’t need to keep their contents once the frame rendering has finished. If developers tell the Mali drivers that depth and stencil buffers do not need to be preserved2— ideally via a call to glDiscardFramebufferEXT (OpenGL ES 2.0) or glInvalidateFramebuffer (OpenGL ES 3.0), although it can be inferred by the drivers in some cases — then the depth and stencil content of tile is never written back to main memory at all. Another big bandwidth and power saving!

 

It is clear from the list above that tile-based rendering carries a number of advantages, in particular giving very significant reductions in the bandwidth and power associated with framebuffer data, as well as being able to provide low-cost anti-aliasing. What is the downside?

 

The principal additional overhead of any tile-based rendering scheme is the point of hand-over from the vertex shader to the fragment shader. The output of the geometry processing stage, the per-vertex varyings and tiler intermediate state, must be written out to main memory and then re-read by the fragment processing stage. There is therefore a balance to be struck between costing extra bandwidth for the varying data and tiler state, and saving bandwidth for the framebuffer data.

 

In modern consumer electronics today there is a significant shift towards higher resolution displays; 1080p is now normal for smartphones, tablets such as the Mali-T604 powered Google Nexus 10 are running at WQXGA (2560x1600), and 4k2k is becoming the new “must have” in the television market. Screen resolution, and hence framebuffer bandwidth, is growing fast. In this area Mali really shines, and does so in a manner which is mostly transparent to the application developer - you get all of these goodies for free with no application changes!

 

On the geometry side of things, Mali copes well with complexity. Many high-end benchmarks are approaching a million triangles a frame, which is an order of magnitude (or two) more complex than popular gaming applications on the Android app stores. However, as the intermediate geometry data does hit main memory there are some useful tips and tricks which can be applied to fine tune the GPU performance, and get the best out of the system. These are worth an entire blog by themselves, so we’ll cover these at a later point in this series.

 

Summary

 

In this blog I have compared and contrasted the desktop-style immediate mode renderer, and the tile-based approach used by Mali, looking in particular at the memory bandwidth implications of both.

 

Tune in next time and I’ll finish off the definition of the abstract machine, looking at a simple block model of the Mali shader core itself. Once we have that out of the way we can get on with the useful part of the series: putting this model to work and earning a living optimizing your applications running on Mali.

 

As always comments and questions more than welcome,

Pete

 

Footnotes

 

  1. Exactly which multisampling options are available depends on the GPU. The recently announced Mali-T760 GPU includes support for up to 16x MSAA.
  2. The depth and stencil discard is automatic for EGL window surfaces, but for offscreen render targets they may be preserved and reused in a future rendering operation.

 

 


Pete Harris is the lead performance engineer for the Mali OpenGL ES driver team at ARM. He enjoys spending his time working on a whiteboard and determining how to get the best out of combined hardware and software compute sub-systems. He spends his working days thinking about how to make the ARM Mali drivers even better.

ArrayFire at MWC14: Maths Libraries Optimized for ARM Mali GPU Compute

$
0
0

At the Supercomputing '13 Conference recently held in Denver ARM first showcased ArrayFire support for the ARM® Mali GPU and the technology is on show to a wider audience this week at Mobile World Congress in Barcelona.  This exciting development is catching the attention of many attendees as they discover the ArrayFire demos at the ARM booth in Hall 6 C10.

 

arrayfire.jpg

Visitors to the AccelerEyes exhibit check out the ArrayFire OpenCL demo at MWC 2014

Energy budgets are always constrained, and form an expensive component of any HPC system. ARM Mali GPUs provide the best performance and throughput for a given energy envelope. Partnering with ARM, AccelerEyes further reduces the cost of HPC by minimizing development time and costs.

 

 

AccelerEyes offers the most productive software solutions for accelerating code using GPUs, coprocessors, and OpenCL devices.  AccelerEyes delivers ArrayFire to accelerate C, C++, and Fortran codes on CUDA and OpenCL devices.  ArrayFire customers come from a wide range of industries, including defense and intelligence, life science, oil and gas, finance, manufacturing, media, and others. ArrayFire has had success accelerating numerous application types, including math and numerical algorithms, image processing, signal processing, statistics, optimization, and more.

 

 

ARM Mali GPU users can deploy ArrayFire as a fast, easy-to-use software solution to enable acceleration of their code. ArrayFire now fully supports OpenCLTM. With ArrayFire v2.0, AccelerEyes has released a demo version of ARM support to select customers. In ArrayFire v2.1, AccelerEyes will make ARM-support an official feature of the ArrayFire library for use on all ARM-enabled platforms.

 

You can visit ARM at the Mobile World Congress this week (February 24-27) to learn more about how ARM Mali users can benefit from ArrayFire. Stay tuned for future updates as ARM and AccelerEyes continue collaboration on this exciting project!

Mucho GPU Compute, amigo!

$
0
0

I have just returned from my third Mobile World Congress (MWC) in Barcelona and I have no doubt this year’s has been for me the best one so far! We are continuing to see a strong trend in independent third party companies using ARM® Mali™ GPU Compute to improve their products and this year we were not just showing demos, we were showing mature solutions on shipping hardware, delivering measured benefits to end users. In case you weren’t at MWC this week, below are some examples of the great GPU Compute solutions that were on show at the event.

 

Mature implementations of VP9 and HEVC: products, not just demos

 

As announced last week, Ittiam Systems demonstrated their HEVC and VP9 implementations accelerated using ARM Mali-T600 GPU Compute technology. It is for over a year now that Ittiam has been supporting OpenCL™ on Mali GPUs in their HEVC codec and their Mali-optimized VP9 implementation (which is built on the same core technology) has been showcased since last year. The tight partnership between ARM and Ittiam has enabled us to optimize our device drivers for this type of workload which enables improved performance and reduced energy and bandwidth consumption for all users. At this year’s event we were able to demonstrate the maturity of both HEVC and VP9 by using GPU Compute. 1080p30 content was shown to decode comfortably on previous generation SoCs, such as the ones that have been shipping in devices as early as 2012. Just imagine the benefits you will get when using the latest Mali GPUs!

 

Watch Mukund Srinivasan, General Manager for Consumer and Mobility Business at Ittiam, talking about how leveraging on ARM Cortex® CPU with ARM NEON™ technology and ARM Mali GPU Compute has enabled them to improve the performance and energy efficiency of their VP9 and HEVC implementations:

 

 

 

At the ARM Booth we showcased a demo kindly supplied to us by the YouTube team, where we streamed real, live VP9 encoded content on a Samsung Galaxy NotePRO 12.2 tablet.

 

RobertoM_2.jpg

 

Thanks to the collaboration with partners like Ittiam Systems, VP9 can be enabled with the additional benefits of Mali GPU Compute. Quoting Matt Frost, Senior Business Product Manager, WebM Project: "With VP9, ARM Mali-based devices will allow people to watch YouTube in high definition at half the bandwidth currently used."

 

As mentioned in the past, many codec vendors in the industry are now working on HEVC implementations and are optimizing them for Cortex CPUs and Mali GPUs. Examples include Squid Systems, PixTree, VisualOn, ArcSoft and more. At MWC 2014, Aricent, a global innovation and services company, was also demonstrating their implementation which has been accelerated using ARM Mali GPU Compute.

 

RobertoM_4.jpg

 

 

Camera applications and real time video processing and analysis

 

Pre and post-processing of still and moving images is also a great use case for GPU Compute. At last year’s event we extensively demonstrated the benefits of GPU Compute acceleration with our partners Synthesis Corporation and MulticoreWare. As the ecosystem of partners around GPU Compute continues to grow, more solutions are becoming available.

 

Alva Systemsprovides a super-resolution solution that makes use of interpolation and texture enhancement technology combined with a smart colour compensation technique. This was optimized for ARM Mali GPUs and supports both the OpenGL® ES and the OpenCL APIs. Alva also implemented an advanced image stabilization solution that in addition to minimizing the shake and motion effect when taking video (pre-encode), can also be used to correct a pre-encoded video stream via real time post processing during video playback. Both super-resolution and stabilization solutions support at least 1080p30 thanks to GPU Compute support.Both these solutions have been demonstrated at MWC14 on an ARM Mali-T628 MP6 GPU-based device.

 

Watch Angela Wei, VP of Sales and Marketing at Alva Systems, discussing this in the following video:

 

 

 

ThunderSoft is another ARM Mali partner, they are a software and services company headquartered in China. One of their products is UCam, a very popular turn-key camera solution shipping in over 50m units (downloads and pre-loads). It features over 60 real time image processing effects operating at full frame, optimized for NEON and GPU processing. ThunderSoft understands the importance of GPU Compute acceleration and has been collaborating with ARM on the optimization of some of the UCam image processing filters using RenderScript on on Android. At MWC we were able to demonstrate real time “manga effect” filters applied to live feed on a Google Nexus 10 tablet. The standard algorithm implementation (with no GPU offload) only achieves a handful of frames per second, and even this fully loads both CPUs. Converting the filters to RenderScript enabled the bulk of the processing to be offloaded to the GPU. We recorded CPU load reduction of over 40%, whilst performance improved many fold to enable real time use of the application.

 

Seth Bernsen, President of ThunderSoft America, talks about the partnership with ARM and the merits of using GPU Compute:

 

 

We see a lot of excitement in the ecosystem around the many partners developing software based computer vision applications. In addition to gesture user interface, at this year’s event we also hosted an advanced face detection and analysis solution implemented by PUX (Panasonic). This technology has been optimized for GPU Compute since last year and was shown publically for the first time at the Embedded Technology 2013 conference in Yokohama. We were excited to showcase an improved implementation at the MWC 2014. The demo supports up to 20 faces being detected at the same time and detects gender, age, facial expression and eye gaze. Imagine this used for profiling people viewing a shop window.

 

RobertoM_5.jpg

 

Beyond Mobile

 

Our key theme for this event was that ARM technology deployment goes beyond mobile and is present everywhere, from sensors to servers. So too is GPU Compute on ARM Mali GPUs.

 

A year ago we pre-viewed a prototype by Aptina, this year we were able to show to our customers an entire ISP pipeline from raw, high resolution, interlaced HDR sensor data all the way to rendering to screen - all of this running on the GPU. As discussed in my presentation at the Electronics Imaging event a few weeks ago, a significant amount of driver optimizations now enable this kind of application on existing SoCs. The computational load required for this type of ISP work, in the absence of hardware ISP, cannot be handled by the CPU on its own. GPU Compute enables this use case.

 

We also demonstrated gesture UI improved using GPU Compute. This is a fantastic example of applied machine vision. Our partner, eyeSight Technologies, is a leader in this field and has collaborated with ARM for a long time to improve the robustness and performance of their gesture detection solution using OpenCL on ARM Mali GPUs. At this event we extended the scope of this application beyond driving a DTV UI control to also showcase how GPU Compute improves in-car UI. The challenge for many gesture detection solutions is their use in poor lighting conditions. eyeSight demonstrated that with the additional compute power of the GPU you can significantly improve the robustness and accuracy of gesture detection.

RobertoM_1.jpg

 

Another one of our partners, AccelerEyes, produce software libraries and tools for GPU Compute. At last year’s SC13 event they were able to showcase a port of their ArrayFire HPC maths library accelerated using OpenCL on an ARM Mali-T600 GPU, and this was again available to see at the MWC this week. Check-out Scott Blakeslee’s blog here.

 

The flexibility and scalability of our architecture enables us to target a variety of use cases, from sensor to server. We also support Full Profile and 64-bit natively, in hardware. After years of evangelising the benefits of such an approach it is nice to see other players in the industry join down this avenue. Similar solutions to what we have been pioneering for some time are starting to appear accross the market; many of these were shown or announced at MWC 2014. It is rewarding to see many industry players being inspired by ARM Mali GPU Compute and the innovative efforts of the ARM ecosystem.

 

I have been overwhelmed by the request for meetings and the positive feedback from our partners and customers and I look forward to future events where we can continue to showcase the great work that our partners and ecosystem are doing through leading the industry around GPU Compute applications on ARM Mali GPUs.

“Better living through (appropriate) geometry”

$
0
0

You know those deep questions you get in philosophy like "If a tree falls in a forest and no one is there to hear it, does it still make a sound?", "What is the sound of one hand clapping?", "Why is there something rather than nothing?", etc. Well the equivalent in graphics terms is probably "How many triangle per second is enough?".

 

Let’s start with a proverb…

“Only when you know why you have hit the target, can you truly say that you have learned archery.” - Guan Yinzi.

 

I’ve put together this blog to assist in the understanding of appropriate use of geometry to drive fragment creation with the hopeful outcome that more people will understand the question – “What is the purpose of a triangle in 3D graphics?”

 

As a generalized statement, 3D graphics objects are made up of convex geometric approximations of the outer hull or “skin” of a real world object that it represents using points in 3D (X,Y,Z) space.  These points are referred to as vertices with a relationship between those vertices such that one or more points will combine to form a primitive, which in OpenGL® ES is constrained to points, triangles or lines. In the case of the triangle this represents a facet of the surface of the object.

 

Why approximation? Well, if you throw enough triangles at it you can pretty much say they are exact (note: my internal monologue is now having a Sheldon Cooper-esque argument about the correctness of that statement), but approximation is usually good enough for your brain to make out the general shape of the object and recognize it as a small off duty Czechoslovakian traffic warden or a banana, as it were. The job of a single triangle in this geometry is to serve as a sample container for the section of the surface it represents, with the detail of that surface being represented by the fragments that will be generated in the final image. The point of this is that the triangle can cheaply represent a container, which can in turn be relatively cheaply manipulated (scaled, rotated, etc.) and the relationship between effects and other sample references (lighting, textures, normal maps, etc.) applied to its vertices can then be cheaply extrapolated to create the fragments which are contained within its surface. The fragments are the “things” value, as they are the things that will be seen, the triangle primitive is the vessel by which they arrive. For more on this see this blog.

“Going beyond is as bad as falling short.” - Chinese proverb

 

Given the above statement, it is expected that the ratio of primitive to fragments is a 1:’N’ relationship where ‘N’ is many. Given this, there is a watermark for ‘N’ where, when rendered, if the geometry consistently yields a relationship where ‘N’ is lower than that watermark, the primitive as a sample container breaks down and so does the efficiency of the GPU and you are now limited by something that was supposed to have trivial cost. In other words, if the cost of the per vertex calculations (where cost is a function of compute, bandwidth, etc.) actually outstrips the total cost of fragments it contains then you are into negative ROI because you are spending more compute on non-visible containers than on visible pixels.

The watermark for ‘N’ depends greatly on the relative fragment and vertex processing costs and can be very complex to calculate. To make life easier, as a rule of thumb, the watermark is usually characterized by the cost of the rasterisation stage, the mathematical process which breaks the primitive into the fragments that its footprint represents. In modern GPUs, rasterisation cost is in the order of 8-10 cycles for a basic triangle; therefore, as the rule of thumb, coverage of 10 fragments to one triangle should be used as the low water mark.

 

Obey the law! - Plowman's Law of Visual Saturation

 

Using the above discussion we have established the purpose of the triangle and some sensible constraints for what we could define as being a meaningful triangle. Given this we can begin to discuss how many triangles are appropriate before you reach a point of visual saturation. Plowman's Law is my attempt at defining a point at which adding more geometry to a scene will yield little to no extra visual return for the additional processing cost of that geometry, i.e. you have reached a point of visual saturation.

 

Plowman's Law defines visual saturation as the point at which a surface of a given resolution which, allowing for overdraw by multiplying by the average overdraw factor, when divided by an average coverage per triangle yields the maximum number of triangles drawn (i.e. not culled before rasterisation). To allow for additional triangles which may be potentially back facing, which in an average convex object would be about 50%, we double the calculation. Laying that out as a formula we end up with something like this:

 

Sensible overdraw factors should lie within the range of 1.3 to 2.0 with 2.0 being a very high overdraw factor for a well-written graphics application. Remember that this is not a single point of overdraw, but the amount of overdraw per pixel averaged across the surface being rendered. The 1.3 factor is much more sensible as this represents about 30% of the screen being redrawn. Now we have our formula, lets look at a simple example for a 1080p screen:

 

Given this number, if we were to multiply by a target frame rate we would derive the required number of triangle/second required to achieve visual saturation in every frame, or the answer to the question "How many triangle per second is enough?". Which in the case of a 1080P screen @ 60Hz = 32M tri/sec.

 

Next time we'll talk about the issues of getting all that performance within the constraints of the a mobile device in “PHENOMENAL COSMIC POWERS! Itty-bitty living space!”

Mobile GPUs at GDC 2104

$
0
0

We are just performing a dry-run of ARM’s GDC demos and I thought it would be good to let everyone know what can be seen at the Moscone Centre between March 19th and 21st where ARM will be demonstrating how our technology is expanding the mobile gaming experience. 

 

As usual we will have the extremely popular ARM Wall covered in large screens showing the latest technology available on the market today.  The largest screen in the middle of the wall will be showing how our mid-range GPUs (the ARM® Mali™-450 based in an Amlogic STB) are able to render in-game UIs and complex game content smoothly at 4K resolutions at vsync – the same silicon is used effectively in mobile devices. As resolutions grow, developers will be able to improve the detail and visual experiences of their mobile games if they develop with 4K in mind.  The Samsung Galaxy Gear smart watch with its ARM Mali-400 GPU will also be showing the same content as the 4K display, thus showing how scalable Mali GPUs are. 

 

To the right of the wall our engineers will be on hand to explain to developers the details and benefits of the ASTC texture compression standard which is now available in shipping devices.  Here you will be able to see the performance, visual and power benefits that have been achieved with this new Khronos-adopted format.

 

Below the large screens we will have a line up of devices highlighting the diverse range of gaming equipment which ships with ARM processors, all of which are able to run high end content.  ARM technology is in over 95% of mobile devices and as smartphone and tablet shipments start to overtake that of PCs and consoles, ARM’s IP will be at the heart of the next generation of gaming.

  • The new Alcatel Idol X+ with a eight core CPU and quad core Mali-450 GPU will be running the latest content from Gameloft.
  • The recently announced Huawei P6S will be demonstrating how low end handsets are able to run the complex content that was previously only possible on last year’s high end phones.
  • The sell-out Hudl tablet will be demonstrating what is now possible on entry level tablets in the mass market.  This is equipped with a Rockchip RK3188 SoC which is built around a Quad Core Cortex A9 CPU and Quad Core ARM Mali 400 GPU.
  • We will also have two ARM Mali powered Android™ games consoles on show: the GameStick and the GamePop.
  • A Google Nexus 10 and Samsung Galaxy Note 10.1 will be showing how the ARM-specific feature ‘Transaction Elimination’ is used to reduce bandwidth usage on the SoC thus saving more power.  Note, as resolutions grow we will see greater power savings.
  • There will also be a demonstration of “Compute Shaders”, a new feature that will be available next version of the OpenGL® ES API.

 

Visit the ARM Tools demo pod to see the latest version of the ARM DS-5™ Streamline™ toolset that gives developers the ability to visualise how both the CPU and GPU cores are loaded.  The Mali Graphics Debugger will also be on show, an invaluable tool that can be used to help optimise game content and find any potential areas that may need extra attention before release.

 

Finally, many of our gaming ecosystem partners will be on the ARM Booth showing their latest solutions designed to help developers succeed in mobile gaming:

  • See the latest real-time global illumination products from Geomerics
  • Simplygon will be showing developers how to condition their assets for different platforms
  • Umbra have tools that improve occlusion culling to yield significant performance uplift
  • Ask Goo and PlayCanvas about their new WebGL/HTML5 development tools
  • Samsung developers will be showing off their Chord multiplayer SDK
  • Havok Anarchy was announced at GDC 2013. See some of the content that has emerged over the last year and how their tools enable stunning 3D Games
  • Cocos 2D  will be demonstrating their new game engine which has been optimised for ARM platforms
  • PerfectWorld will be showing games created with their Echo-es game engine
  • Tencent will be able to show you some of the new features available from their Atomic game engine
  • Take a close look at the Genesis 3D demo available from Sohu

 

Our multiplayer gaming zone will be running again this year with a wide selection of competitive games from our partners which have been optimized for mobile using the resources on show at the Tools demo pod.  Discover the quality of graphics that is now achievable on mobile devices, then take part in the Sports Car Challenge to win a brand new ARM Mali-T628 GPU enabled Samsung Galaxy Note 10.1 2014 Edition.

 

Finally, at the Smart Phone Summit (Monday and Tuesday) we will have a selection of demos on show in advance of the main show.  Here you will be able to discuss the tools in great detail with the engineers who created them, see a vast selection of ARM powered products and see how the gap between mobile and console game graphics continues to narrow.

 

See you at the show.

Phill

 

PS. Did anyone spot I missed what will be on the left of the Mali wall?  You will have to wait and see! Take a guess in the comments box below...

Viewing all 266 articles
Browse latest View live