Last week I had the great opportunity to deliver a talk at the Electronic Imaging Conference in San Francisco. The conference track was introduced by Michael Frank of LG Electronics, who gave a very interesting historical backdrop to parallel computing technologies and the importance of heterogeneous computing. Philip Rogers and Harris Gasparakis of AMD followed with a fine overview of the rationale and merits of the Heterogeneous System Architecture and its implications for image processing and computer vision.
Soon after that it was my turn. I could have talked for days, but as I only had a 45 minutes slot I opted to focus my talk on mobile computing and the taking and processing of photo images.
(Pre) history of the camera phone and the rise of smartphone photography
Many attribute the birth of the camera phone to Philippe Kahn, a technology pioneer who broadcasted the birth of his daughter in June 1997 through a small camera integrated into a Motorola phone. The first commercial phones with camera started appearing in the year 2000 with resolutions as low as 0.11 MPix and limited storage capability (only 10s of photos). Initially, not all devices supported browsing and wireless electronic sharing, in some cases the user had to plug the phone to a PC via cable and download its pictures.
By 2002, resolutions had grown to 0.3 megapixels, the camera phone had reached the U.S. shores, and features such as white balance control, self-time, digital zoom, and various filter effects were making their way into handsets.
Then the race of the megapixel started. In 2004, the one megapixel barrier was smashed, making photo quality sufficient for small prints - and some handsets allowed you to even take up to 8 photos in a burst shot!
Resolutions quickly ran up to two, three and five megapixels and features such as LED flash, professional brand optics, image stabilization, autofocus and in some cases even futuristic rotating screens started to appear.
You can find more details in this great article.
Today, smartphone photography is gradually becoming the most popular form of photo taking with many people using their phone as the primary camera. By 2011 over a quarter of photographs were taken on a mobile phone (NDP Group, 2012) and 18% of consumers considered their phone to be their primary camera (Consumer Electronics Association, 2012). A more recent study has seen this trend consolidate, with 58% of NA Smartphone photo taking consumers exclusively take photos with their phone in 2013. This, in turn, stimulates photo sharing. For example, when Instagram launched, over 100k people signed up in the first week, and in 2013 over 150M active users were registered.
Developers are starting to take advantage of these trends. A quick search for camera apps in the Google Play™ store returned several hundred applications:
The Spring of Computational Photography
Computational photography refers to computational image capture, processing, and manipulation techniques that enhance or extend the capabilities of digital photography.
The computational photography features of a modern smartphone are impressive:
- Multi image processing: picture-in-picture dual shot, “beautification”, “best face” photo composition, HDR, 3D panorama.
- Real time processing: stabilization, de-noising, face and features detection, motion capture.
- Post production: advanced image filters, occlusion removal, 2D to 3D conversion.
However, for all the great technology you put in the sensor, etc, mobile devices are still limited by size, form factor and cost considerations. The sensor and lens on a phone are never going to match a full size LSR. Then on top of that you have to compensate with outside factors such as challenging lighting conditions and skill (or lack of) of the photographer.
Some truly innovative software solutions have recently emerged to resolve these challenges, and I focused on these for the rest of my talk.
Image sensors and the role of the GPU
Over 1 billion higher quality camera phone modules (at least 3 megapixel resolution) will ship in 2014 and over half of these modules will have at least 8 megapixel sensors. The camera phone module market was $10.8 billion in 2012, and is expected to have grown to $13.3 billion in 2013, or 24% per year (IC Insights, 2013; Research in China, Global and China CMOS Camera Module Industry Report, 2012-2013).
One of our many partners in this market, Aptina Imaging, has taken this technology even further. They prototyped a platform consisting of their latest interlace HDR sensor set up to feed raw data to the Samsung Exynos 5250 SoC on an Insignal Arndale Board. They then offloaded the whole image post-processing pipeline to the ARM Mali-T604 GPU using OpenCL™, including noise reduction, iHDR reconstruction, tone mapping, color-conversion, gamma and de-noising.
Why is this approach interesting? A software solution can offer more flexibility and enables algorithm modifications right up to the release of the consumer device. Sensor and camera module vendors can invest in optimized portable software libraries instead of hardware. One camera module and sensor configuration can be used to address a broader market requirements. At the same time SoC implementers can reduce device costs by offloading selected ISP blocks to the CPU + GPU subsystem, and optimize/balance interoperation between traditional ISP processing and the GPU.
Improving interoperations
A critical aspect of heterogeneous computing is the communication of data between the various processors. OpenCL implements a host-device architecture, where your application prepares data buffers then hands them over to the device (worker) which, when it has finished, copies the data back. OpenCL has its roots in traditional GPGPU Computing, which expects discrete VRAM. Even if copies are not physically required (ie same physical RAM like in an ARM SoC), the programming abstraction and API still envisages copies (which may well be redundant). We have optimized our drivers to remove these overheads, by avoiding redundant copies.
Another problem can occur if we want to use OpenCL (which is a “general purpose” heterogeneous compute API) together with any graphics processing we are doing with OpenGL® ES, even if the latter is only used for rendering output frames. The default operation is to copy data between the two API contexts. This leads to further redundant copies, which can have very bad impact on performance.
Before interrops extensions support was introduced, we had to:
- Read the required data into the GPU OpenCL memory space,
- Perform all the required processing on the GPU,
- Map the image back to CPU side,
- Upload the data/image again to GPU memory on OpenGL ES side,
- Render using OpenGL ES texture.
Interrops create a common memory space between the CPU and GPU (EGL™/OpenCL), and between graphics and compute APIs (OpenGL ES/OpenCL). The diagram below illustrates the 50% improvement in performance as well as the fact that we are now using almost the total of our application time doing useful work (as opposed as spending over a third of our time moving data around – which in addition to consuming time has a even more pronounced impact on energy consumption).
Further experimentations: auto-white balance
Each light source has its individual “colour” or “temperature”, which varies between red (sunset, candle light) and blue (bright Californian sky, but not this time of the year). The human eye automatically adjusts to this, so that it does not matter in what lighting condition you see the colour white, it will still look white. For cameras you either need to manually calibrate (use presets such as “daylight”, “indoors”, etc) or post-process the raw data at a later stage, or… you use the Auto White Balance (AWB) setting and let the camera take care of it.
ARM experimented implementing AWB in software using OpenCL on an ARM® Mali™ T604 GPU. Our AWB algorithm uses the Gray World Assumption. This algorithm was chosen because popularity. The Gray World uses the average colour of the entire image to correct by a colour shift.
Our simplified GPU implementation is divided into three stages, as depicted by the diagram below:
- First, a kernel calculates the R, G and B channel sums using atomic operations (accelerated in hardware). This stage operates on a smaller, down-sampled 8 bit image.
- An average for each channel is then calculated by dividing the outputs from the previous kernel by the image size. This is then converted into YUV and pass to the next stage.
- We work on the full-resolution 16-bit image. We convert every pixel from RGB to YUV and correct the colour based on stage 2’s outputs. We then converts the pixel from YUV to RGB and store it to the output buffer.
Our first version of the algorithm computed one pixel per work item and used INT16 data types. We then optimized this improving computation to four pixels per work item and we also rearranged the calculations to make better use of the ARM Mali-T600 GPU’s ALUs, resulting in a 7.4x improvement, resulting in a 1.5 order of magnitude performance uplift compared to our CPU-only reference implementation. And we managed all of this in less than 90 lines of OpenCL code.
HDR, segmentation and more
Another ARM partner, Apical (who were also presenting at the event), have optimized their advanced tone mapping algorithms with OpenCL on Mali and were able to achieve close to a 20x uplift in performance compared to their CPU/NEON™-only implementation.
Harri Hirvola of the Nokia Research Center implemented a superpixel clustering algorithm based on SLIC (Simple Linear Iterative Clustering) using OpenCL.
Clustering is needed to do things like object extraction or removal for example. The implementation used five kernels and resulted in a significant improvement in GPU acceleration/offload.
Engaging with Partners
A fundamental part of ARM enabling new technologies is fostering a strong ecosystem. We work very closely with our customers, OEMs, third party software vendors and developers. In this context we supply developers and other interested parties with platforms and tools to explore and develop using the latest processing technologies such as GPU Compute on ARM Mali GPUs. To find out more about this, email: gpucompute-info@arm.com or check out http://malideveloper.arm.com. To reach out to our many partners active in this field, register for the ARM Connected Community today and join ARM Mali Graphics.
It was a privilege for me to take part in the Electronic Imaging Conference and I would like to thank again Michael Frank and his team for asking ARM to take part.