SagivTech's OpenCL Mobile Computing Benchmark Suite

The Mobile Computing Benchmark Suite

Mobile device manufacturers as well as consumers have shown growing interest for greater computing power on mobile devices. This processing power is necessary for gaming, augmented reality, image and video enhancements and other compute intensive applications that run on mobile devices.

One way of increasing mobile device processing power is better exploitation of the massive computing power offered by on-board Graphic Processing Units (GPUs), which can be used to carry out compute- intensive tasks with relatively low power consumption.

Objective measurement of key performance indicators has always been crucial to better understanding how a given platform performs. The best method of achieving these key performance indicators may vary from vendor to vendor, due to different hardware designs and software stack limitations. This complicates the process of evaluating different types of devices and deciding which platform is faster and to what degree – and this complexity accentuates even more the need for benchmarking.

For all these reasons, SagivTech created a MobileComputing Benchmark Suite–to establish reliable performance metrics for common mobile computing operations and the expected performance of GPU code on specific hardware. This post will take a closer
look at two of the primary benchmarks we included in our suite – bandwidth and FLOPs. This post will show the sustained results of those two crucial performance aspects, measured on the ARM Mali T628MP6 GPU found in the Samsung Galaxy Note 10.1 device.

Bandwidth Benchmarks

The bandwidth test measures how fast the input data (once in memory visible to the GPU) can reach the GPU’s ALUs for processing. This is usually measured in Gigabytes of data per second. An algorithm is considered to be bandwidth-limited if the portion
of the code that takes most of the time for the algorithm to complete is loading and storing data between memory and the ALUs. The faster the data can reach the ALUs, the faster the algorithm work, and faster the user will see the output.

Currently, SagivTech’s Bandwidth Benchmark tests use only float-type variables, with the four sub-tests using float, float2, float4 and float8 variables. The test kernel code is very simple - just copying one variable from the input buffer to the output buffer.

Figure 1 shows the measured GB/s on ARM’s Mali T628MP6 GPU.

FLOPs Benchmark

While a bandwidth bounded algorithm is limited by the time spent in its memory subsystem, a compute bound algorithm is limited by the time spent by the GPU doing math computation. A compute bound algorithm is usually measured by means of FLOPs –
measuring how many floating point operations can be done in one second. For example, A = b + c, would count for one FLOP, A = b + c + d, would count for two FLOPS. On most common GPU implementations, peak performance is calculated by
counting an FMA operation, A = b + c * d, as two FLOPs, even though most platforms can calculate this in one clock.

For compute bounded algorithms running on the GPU, the higher the peak theoretical and sustained FLOPs the device can achieve, the faster the algorithm will run. As with the bandwidth case, achieving a high FLOPs rate in real-world applications
can be very challenging, and the method of achieving this can significantly vary from platform to platform.

SagivTech’s FLOPs Benchmark calculates the peak sustained FLOPs a compute enabled GPU can achieve. In order to achieve this, SagivTech created a set of test scenarios to see how each platform behaves under different types of load and usage, and how
much of the peak performance of the GPU each scenario actually achieves.

The table below details all the scenarios we tested in the FLOPs Benchmark. The table shows all the parameters used to calculate how many FLOPs each test yielded on the platform at hand. Of course, these tests are not real-world - they were defined to demonstrate
how peak performance FLOPs are calculated and how well each such scenario maps to the current platform.

#	Operations	Inst #	Calculation	Var type	FLOPs
1	Simple add	1	val1 = value1 + val1	float	1 * 1 * 1 = 1
2		4	val1 = value1 + val1 val2 = value1 + val2 val3 = value1 + val3 val4 = value1 + val4	float	4 * 1 * 1 = 4
3	MADD	1	val1 = value1 + val1 * value2	float	1 * 2 * 1 = 2
4		4	val1 = value1 + val1 * value2 val2 = value1 + val2 * value2 val3 = value1 + val3 * value2 val4 = value1 + val4 * value2	float	4 * 2 * 1 = 8

More coding techniques have been employed to make sure the test results are consistent, such as manual unrolling of the loop inside the kernel, averaging kernel timings by running the kernel in a loop on the host etc.

Below is a sample kernel code for scenario #4 in the above table, while using float4 as the variable type:

FLOPs Test Kernel
__kernel void kernelFlopsTest(__global float4 data, int iterations) { int tid = get_global_id(0); float4 val1, val2, val3, val4, value1, value2; val1 = data[tid]; value1.x = val1.w; value1.y = val1.z; value1.z = val1.y; value1.w = val1.x; value2.x = val1.w; value2.y = val1.w; value2.z = val1.y; value2.w = val1.y; val2.x = val1.w; val2.y = val1.z; val2.z = val1.y; val2.w = val1.x; val3 = value1; val4 = value2; for (int j = 0 ; j < iterations; ++j) { val1 = value1 + val1 value2; val2 = value1 + val2 * value2; val3 = value1 + val3 * value2; val4 = value1 + val4 * value2; /* additional 98 times of the above two lines / val1 = value1 + val1 value2; val2 = value1 + val2 * value2; val3 = value1 + val3 * value2; val4 = value1 + val4 * value2; } data[tid] = (val1 + val2 + val3 + val4); };

FLOPs Test Kernel

__kernel void kernelFlopsTest(__global float4 *data, int iterations)
{
     int tid = get_global_id(0);
     float4 val1, val2, val3, val4, value1, value2;
     val1 = data[tid];
     value1.x = val1.w; value1.y = val1.z; value1.z = val1.y; value1.w = val1.x;
     value2.x = val1.w; value2.y = val1.w; value2.z = val1.y; value2.w = val1.y;
     val2.x = val1.w; val2.y = val1.z; val2.z = val1.y; val2.w = val1.x;
     val3 = value1; val4 = value2;
     for (int j = 0 ; j < iterations; ++j)
     {
         val1 = value1 + val1 * value2; val2 = value1 + val2 * value2;
         val3 = value1 + val3 * value2; val4 = value1 + val4 * value2;
         /* additional 98 times of the above two lines */

val1 = value1 + val1 * value2; val2 = value1 + val2 * value2;

val3 = value1 + val3 * value2; val4 = value1 + val4 * value2;

}

data[tid] = (val1 + val2 + val3 + val4);

};

Figure 2 shows the measured GFLOP/s on ARM Mali-T628MP6 GPU

Wrapping Up

SagivTech created the Mobile Computing Benchmark Suite to make it easier to evaluate how well new hardware might perform when running heavy computing tasks. The first steps in this evaluation are to measure the GPU’s maximum sustained memory
subsystem performance and its floating arithmetic capabilities. Comparing these sustained performance numbers to the theoretical peaks that the device is supposed to yield can help in identifying device strengths, capabilities and limitations.

To learn more about SagivTech’s Mobile Computing Benchmark Suite and the full results of our initial benchmarking, please download the Mobile Computing Benchmark Suite White Paper.

About SagivTech

SagivTech is a leading and veteran provider of innovative technology, solutions and services for GPU computing and computer vision, and a recognized source of expertise in image and signal processing algorithms and software development for parallel computing platforms. For more information, please visit www.sagivtech.com.

Disclaimer

Reference to any specific commercial product, process or service by its trade name, trademark, and manufacturer or otherwise, does not constitute or imply its endorsement or recommendation by us. Any trademark reference belongs to its owner and we make no claim as to its use or ownership and will use it only to truthfully and accurately identify such product or service.

WE WILL NOT BE LIABLE TO YOU OR ANYONE ELSE FOR ANY LOSS OR DAMAGES OF ANY KIND (INCLUDING, WITHOUT LIMITATION, FOR ANY SPECIAL, DIRECT, INDIRECT, INCIDENTAL, EXEMPLARY, ECONOMIC, PUNITIVE, OR CONSEQUENTIAL DAMAGES) IN CONNECTION WITH THE DATA OR YOUR USE THEREOF OR RELIANCE THEREUPON, EVEN IF FORESEEABLE OR EVEN IF WE HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES (INCLUDING, WITHOUT LIMITATION, WHETHER CAUSED IN WHOLE OR IN PART BY NEGLIGENCE, GROSS NEGLIGENCE, OR OTHERWISE, BUT EXCLUDING WILLFUL MISCONDUCT) UNLESS SPECIFIED IN WRITING. OUR TOTAL AND AGGREGATE LIABILITY IN CONNECTION WITH THE DATA OR YOUR USE THEREOF OR RELIANCE THEREUPON WILL NOT EXCEED USD $100.

YOUR USE AND RELIANCE UPON THE DATA IS AT YOUR RISK. IF YOU ARE DISSATISFIED WITH THE DATA OR ANY OF THE INFORMATION, YOUR SOLE AND EXCLUSIVE REMEDY IS TO DISCONTINUE USE OR RELIANCE OF ON THE DATA.

YOU ACKNOWLEDGE AND AGREE THAT IF YOU INCUR ANY DAMAGES THAT ARISE OUT OF YOUR USE OF THE DATA OR RELIANCE THEREUPON, THE DAMAGES, IF ANY, ARE NOT IRREPARABLE AND ARE NOT SUFFICIENT TO ENTITLE YOU TO AN INJUNCTION OR OTHER EQUITABLE RELIEF RESTRICTING EXPLOITATION OF ANY DATA, PRODUCT, PROGRAM, OR OTHER CONTENT OWNED OR CONTROLLED BY US.

The Mobile Computing Benchmark Suite

Bandwidth Benchmarks

FLOPs Benchmark

Wrapping Up

Latest Images

Trending Articles

Latest Images