Vertex interleaving
Recently we were asked via the community whether there was an advantage in either interleaving or not interleaving vertex attributes in buffers. For the uninitiated, vertex interleaving is a way of mixing all the vertex attributes into a single buffer. So if you had 3 attributes (let’s call them Position (vec4), Normal(vec4), and TextureCoord(vec2)) uploaded separately they would look like this:
P1xP1yP1zP1w , P2xP2yP2zP2w , P3xP3yP3zP3w ... and so on
N1xN1yN1zN1w , N2xN2yN2zN2w , N3xN3yN3zN3w ... and so on
T1xT1y , T2xT2y , T3xT3y ... and so on
(In this case the commas denote a single vertex worth of data)
The interleaved buffer would look like this:
P1xP1yP1zP1w N1xN1yN1zN1w T1xT1y , P2xP2yP2zP2w N2xN2yN2zN2w T2xT2y ,
P3xP3yP3zP3w N3xN3yN3zN3w T3xT3y ... and so on
(Note the colours for clarity)
… Such that the individual attributes are mixed, with a given block containing all the information for a single vertex. This technique is what the stride argument in the glVertexAttribPointer function and its variants is for, allowing the application to tell the hardware how many bytes it has to jump forwards to get to the same element in the next vertex.
However, even though we all knew about interleaving, none of us could really say whether it was any better or worse than just putting each attribute in a different buffer, because (to put it bluntly) separate buffers are just easier to implement.
So in a twist to usual proceedings I have conferred with arguably the top expert in efficiency on Mali, Peter Harris. What follows is my interpretation of the arcane runes he laid out before my quivering neurons:
Interleaving is better for cache efficiency…
… Sometimes.
Why does interleaving work at all?
The general idea behind interleaving is related to cache efficiency. Whenever data is pulled from main memory it is loaded as part of a cache line. This single segment of memory will almost certainly contain more than just the information desired, as one cache line is larger than any single data type in a shader program. Once in the local cache the data in the loaded line is more quickly available for subsequent memory reads. If this cache line only contains one piece of required information, then the next data you need is in a different cache line which will have to be brought in from main memory. If however, the next piece of data needed is in the same cache line, the code can fetch directly from the cache and so performs fewer loads from main memory and therefore executes faster.
Without getting into physical memory sizes and individual components, this can be illustrated thusly:
Imagine we have 3 attributes, each of them vec4s. Individually they look like this:
| P1 P2 | P3 P4 | ...
| N1 N2 | N3 N4 | ...
| T1 T2 | T3 T4 | ...
From this point forward those vertical lines represent the boundaries between cache lines. For the sake of argument, the cache lines in this example are 8 elements long, so contain 2 vec4s; but in the real world our cache lines are 64 bytes, large enough to hold four 32-bit precision vec4 attributes. For the sake of clear illustration I’ll be keeping the data small in these examples, so if we want all the data for vertex number 2 we would load 3 cache lines from the non-interleaved data:
P1 P2
N1 N2
T1 T2
If this data is interleaved like so:
| P1 N1 | T1 P2 | N2 T2 | P3 N3 | T3 P4 | N4 T4 | ...
The cache lines fetched from main memory will contain:
T1 P2
N2 T2
(We start from T1 because of the cache line alignment)
Using interleaving we've performed one less cache line fetch. In terms of wasted bandwidth, the non-interleaved case loaded 3 attributes which went unused, but only one unused attribute was fetched in the interleaved case. Additionally, it's quite possible that the T1 P2 cache line wouldn't need to be specifically fetched while processing vertex 2 at all; if the previously processed vertex was vertex 1, it is likely that the data will still be in the cache when we process vertex 2.
Beware misalignment
Cache efficiency can be reduced if the variables cross cache line boundaries. Notice that in this very simple example I said the texture coordinates were vec4s. Ordinarily textures would be held in vec2 format, as shown in the very first explanation of interleaving. In this case, visualising the individual elements of the buffer, the cache boundaries would divide the data in a very nasty way:
PxPyPzPw NxNyNzNw | TxTy PxPyPzPw NxNy | NzNw TxTy PxPyPzPw | …
Notice that our second vertex's normal is split, with the x,y and z,w in different cache lines. Though two cache lines will still contain all the required data, it should be avoided as there is a tiny additional power overhead in reconstructing the attribute from two cache lines. If possible it is recommended to avoid splitting a single vector over two cache lines (spanning a 64-byte cache boundary), which can usually be achieved by suitable arrangement of attributes in the packed buffer. In some cases adding padding data may help alignment, but padding itself creates some inefficiencies as it introduces redundant data into the cache which isn’t actually useful. If in doubt try it and measure the impact.
But it's not always this simple
If we look at the function of the GPU naïvely, all of the above makes sense, however the GPU is a little cleverer than that. Not all attributes need to be loaded by the vertex processor. The average vertex shader looks something like this:
uniform vec4 lightSource;
uniform mat4 modelTransform;
uniform mat4 cameraTransform;
in vec4 position;
in vec4 normal;
in vec2 textureCoord;
in vec2 lightMapCoord;
out float diffuse;
out vec2 texCo;
out vec2 liCo;
void main( void ){
texCo = textureCoord;
liCo = lightMapCoord;
diffuse = dot((modelTransform*normal),lightSource);
gl_Position=cameraTransform*(modelTransform*position);
}
If you look at our outputs diffuse is calculated at this stage, as is gl_Position, but texCo and liCo are just read from the input and passed straight back out without any computation performed. For a deferred rendering architecture this is really a waste of bandwidth as it doesn’t add any value to the data being touched. In Midgard family GPUs (Mali-T600 or higher) the driver understands this (very common) use case and has a special pathway for it. Rather than load it in the GPU and output it again to be interpolated, the vertex processor never really sees attributes of this type. They can bypass the vertex shader completely and are just passed directly to the fragment shader for interpolation.
Here I've used a second set of texture coordinates to make the cache align nicely for this example. If we fully interleave all of the attributes our cache structure looks like this
PxPyPzPw NxNyNzNw | TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy | ...
Here the vertex processor still needs to load in two attributes P and N, for which the cache line loads will either look like:
PxPyPxPw NxNyNzNw
… or …
TxTy LxLy PxPyPzPw | NxNyNzNw TxTy LxLy
… to obtain the required data, depending on which vertex we are loading. In this latter case the T and L components are never used, and will be loaded again separately to feed into the interpolator during fragment shading. It’s best to avoid the redundant data bandwidth of the T and L loads for the vertex shading and the redundant loads of P and N when fragment shading. To do this we can interleave the data into separate buffers, one which contains all of the attributes needed for computation in the vertex shader:
PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | PxPyPzPw NxNyNzNw | ...
… and one containing all of the attributes which are just passed directly to interpolation:
TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | TxTy LxLy TxTy LxLy | ...
This means that the vertex shader will only ever need to touch the red and green cache lines, and the fragment interpolator will only ever have to touch the blue and orange ones (as well as any other interpolated outputs from the vertex shader). This gives us a much more efficient bandwidth profile for the geometry processing. In this particular case it also means perfect cache alignment for our vertex processor.
A note on data locality
Caches function best when programs make use of the data in the same cache lines in a small time window. This maximizes the chance that data we have fetched is still in the cache and avoids a refetch from main memory. Cache lines often contain data from multiple vertices which may come from multiple triangles. It is therefore best practise to make sure that these adjacent vertices in memory are also nearby in the 3D model (both in terms of attribute buffers and index buffers). This is called data locality and you normally need look no further than your draw call's indices (if you are not using indexed models you have far bigger problems than cache efficiency to solve). If the indices look like this:
(1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 2) (1, 3, 5) ...
You have good data locality. On the other hand, if they look like this:
(1, 45, 183) (97, 12, 56) (4, 342, 71) (18, 85, 22) ...
… then they're all over the place and you’ll be making your GPU caches work overtime. Most modelling software will have some kind of plugin to better condition vertex ordering, so talk to your technical artists to get that sorted somewhere in the asset production process.
To maximize the cache efficiency it’s also worth reviewing the efficiency of your vertex shader variable types, both in terms of sizes and number of elements. We see a surprising amount of content which declares vector elements and then leaves many channels unused (but allocated in memory and so using valuable cache space); or which uploads highp fp32 data and then uses it in the shader as a medium fp16 value. Removing unused vector elements and converting to narrower data types (provided the OES_vertex_half_float extension is available) is a simple and effective way to maximize cache efficiency, reduce bandwidth, and improve geometry processing performance.
So there you have it, interleaving vertex attributes. It would be remiss of me to tell you to expect immediate vast performance improvements from this technique. At best this will only cleave back a tiny bit of efficiency but in large complex projects where you need to squeeze as much as possible out of the hardware, these tiny improvements can all add up.
Thanks again to Peter Harris, who provided a lot of the information for this blog and also was kind enough to go through it afterwards and take out all my mistakes.