The Supercomputer In Your Pocket: Metal & Swift

Join Simon Gladman as he talks about Metal, Apple’s iOS framework for GPU programming. We explore how to implement Metal in a Swift project, looking at different approaches to sharing data between the two, how to build massively parallel applications, and how this enables entirely new classes of mobile software.

Often said, but worth repeating: today, you carry the equivalent of a supercomputer in your pocket. With Metal, you can build entire classes of software that, quite literally, used to require one.


Overview (0:00)

This afternoon, I’m going to talk to you about the other “M” word with iOS, and that M word is “Metal”, Apple’s framework for GPU programming. We’ll be looking at creating some software that runs on iPad that not so long ago would have been in the domain of a huge Cray supercomputer.

What is Metal? (0:21)

What is Metal? As I said, it’s a new set of technologies for GPU programming. Although technologies such as Scene Kit can get you up and running with 3-D objects and textures very quickly, Metal offers you a lot more control and better performance, but there is obviously a development overhead. We’ll be looking at that development overhead this afternoon. It’s designed for GPU accelerated 3-D graphics, and data paralleled computation. It’s data computation that we will be looking at specifically. It’s tied to iOS and Apple’s A7 processors or greater. And it’s not really just for games; if you’re doing 3D work it could be used for architectural projects, and the data parallel aspects are good for data visualisation or simulations.

Types of Shader (1:12)

If you’ve worked with GPU programming before, you probably know two different types of shader. A shader is a little program that runs on the GPU. Typically you’ll start off with a vertex shader, which will take all your 3-D data — all your points — and then map them to 2-D screen coordinates. After that part has been done, the data gets rasterised, or turned into pixels. Then it’s up to the fragment shader to look at each pixel, at the normal direction of the surface, and take into account the textures, lighting, and color.

We’re going to look at a third type of shader, which is called a kernel shader, or compute shader. It isn’t really involved in 3-D data, it operates on raw data, such as an array of data, or a grid, such as an image. It’s used for things such as image processing and filters, but today we’re going to be using it to process a massive array, or big array, of individual particles. First of all we’ll look at a simple system where all the particles are independent of each other little, but they all get attracted to one gravity well. Then we’ll look at a more complex system where every particle knows about every other particle. The shader will update the particles’ positions and velocity.

Setting up Metal in a Swift Project (2:31)

I said there was some development overhead. There are a few steps you will need to take. First we need to create this point of reference to a device — the Metal device — and that’s our root object, the interface of you like to the GPU. Then we create a library, and the library is a repository of the Metal shader functions we are going to write. Then we create a command queue, which enables us to queue up and submit a list of commands to the device, to the GPU, to execute.

// Create a device
let device: MTLDevice = MTLCreateSystemDefaultDevice()

// Create a default library
let defaultLibrary: MTLLibrary = device.newDefaultLibrary()

// Create a command queue
let commandQueue: MTLCommandQueue = device.newCommandQueue()

We need to create a reference to the function itself, the piece of Metal code or C++ code that we’re going to write — a Metal function. The particle renderer shader string literal is actually a reference to the shader name inside the Metal code. Then we need to create the pipeline state, which holds the compiled shader code. This method, newComputePipelineShaderFunction (nicely named) does that compiling. It’s quite a costly operation, which is probably why the pipeline is immutable — it is long-lived and it’s persistent, so you only need to do that once. We then create a command buffer, which stores the encoded commands. This is more transient, it’s a single use object. Next up is a command encoder, which encodes the kernel shaders into bytecode, which can then be written to the command buffer. The method setComputePipelineState on the encoder sets the compute function to be executed.

// Define the kernel function
let kernelFunction: MTLFunction = defaultLibrary.newFunctionWithName("particleRendererShader")

// Define the pipeline state
let pipelineState: MTLComputePipelineState =
      device.newComputePipelineStateWithFunction(kernelFunction!, error: nil)

// Define the command buffer
let commandBuffer: MTLCommandBuffer = commandQueue.commandBuffer()

Define the command encoder
let commandEncoder: MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(pipelineState)

Then we’ll see how we can use functions setBuffer and setTexture to pass resources such as textures, values, floats, and scalar values into the command encoder.

Preparing the Kernel & Texture (4:21)

Now that we have the encoder, we can start feeding it resources. One of these resources might be a float value. particleBrightness is a float value, but it cannot be passed in directly, it must be passed through a Metal buffer. I need to create a new buffer with the bytes of that particle float, and I need to set its length, which is going to be the size of the float. Then I can pass that buffer into the command encoder.

// Set some parameters
var particleBrightness: Float = 0.8

let particleBrightnessBuffer: MTLBuffer =
  device.newBufferWithBytes(&particleBrightness,
  length: sizeof(Float), options: nil)

commandEncoder.setBuffer(particleBrightnessBuffer, offset: 0, atIndex: 2)

When we draw all these particles, we need a texture to draw them to. I can’t create a texture directly; instead, I need to create a description of the texture. Here I’m creating an RGBA8Unorm, which means that it was 8 bits per pixel for Red, Green, Blue, and Alpha transparency, and Unorm means that the values go from 0 to 255. When I have a description of that texture, I can pass it into our root device to return an implementation of the texture.

// Initialise textures
let textureDescriptor =
   MTLTextureDescriptor.texture2DDescriptorWithPixelFormat(
       MTLPixelFormat.RGBA8Unorm,
       width: Int(imageSide), height: Int(imageSide),
       mipmapped: false)

var textureA: MTLTexture =
   device.newTextureWithDescriptor(textureDescriptor)

Once I have the texture, using setTexture, which is analogous to setBuffer, I can pass it to the encoder. I’ve given it the argument atIndex as zero. If I wanted an in-and-out texture, I could use that line again but with a different index.

var textureA: MTLTexture!

commandEncoder.setTexture(textureA, atIndex: 0)

Creating a Simple Particle System (6:06)

The particles themselves have a more complex value object; they have properties such as position and velocity. Because I need to pass them into Metal in an array, I need to declare them in both Metal and Swift. Because Metal is written in C++, you need to have semicolons (remember those?). The Metal shader will update both the positions and velocities of those particles based on a simple physics model.

// Create a Particle type in Swift
struct Particle
{
      var positionX: Float = 0
      var positionY: Float = 0
      var velocityX: Float = 0
      var velocityY: Float = 0
}

// Create its Metal equivalent
struct Particle
{
      float positionX;
      float positionY;
      float velocityX;
      float velocityY;
};

In Swift, we will create and populate an array of particles. We loop over the entire particleCount — a quarter million in this example — and create Particles based on random velocities and positions. We have one set of particles coming in, which is the source of information, and we will rewrite them to produce a set of particles going out. To rewrite them, we do almost the same as what we did for float. We need to figure out the length using sizeofValue, create the buffer inVectorBuffer, and use setBuffer to pass that into the shader. Coming out, we need an empty array that has enough space to hold the particles using newBufferWithBytes. We pass that once more into the command encoder and setBuffer, but instead using an atIndex of 1.

// Determine length of buffer
let particleVectorByteLength = particles.count *sizeofValue(particles[0])

// Create and populate the buffer
var inVectorBuffer = device.newBufferWithBytes(
  &particles, length: particleVectorByteLength, options: nil)

// Pass the buffer to the shader
commandEncoder.setBuffer(inVectorBuffer, offset: 0, atIndex: 0)

Setting up Metal: Threadgroups (7:55)

The next stage is to figure out how to chop up the work for the GPU to handle it. This is done with threadgroup synchronization functions; the thread execution width is the number of threads that are scheduled to run concurrently on the GPU. The first example is a two dimensional thread group with a size of eight by eight, which might be used for image processing. We are more interested in this one dimensional thread group with a width of thirty-two and a height of one.

// Two dimensional thread groups are used for image processing
threadGroupCount = MTLSize(width: 8, height: 8, depth: 1)
threadGroups = MTLSize(width: 1024 / threadGroupCount.width,
     width: 1024 / threadGroupCount.height,
     depth: 1)
// A one dimensional thread group Swift to pass Metal a one dimensional array
threadGroupCount = MTLSize(width:32, height:1, depth:1)
threadGroups = MTLSize(width:(4096 + 31) / 32, height:1, depth:1)

Executing the Kernel Function (8:28)

We now have details of the compute shader and the function we want to run, we’ve added some resources (textures, floats, and values), and defined how we want our workload broken up. It’s time to execute the shader by coding the command with dispatchThreadgroups. commit will cause commandBuffer to be executed as soon as possible, and that’s where the work really happens.

Accessing the Arrays in the Shader (8:56)

Here’s an example of a kernel function particleRendererShader that will execute one time for each of our particles. We also have a uint called id, which is basically a current index of where we are. inParticle is a read-only constant where we pass in our source, and we write to outParticle. The const Particle refers to the particle that is subscript to the id, and the last lines update the position of the particle based on its current position and velocity.

kernel void particleRendererShader(
       const device Particle *inParticle [[ buffer(0) ]],
       device Particle *outParticle [[ buffer(1) ]],

       constant float &particleBrightness [[buffer(2)]],

       uint id [[thread_position_in_grid]])
{
    const Particle thisParticle = inParticle[id];

    outParticle[id].positionX =
       thisParticle.positionX + thisParticle.velocityX;

    outParticle[id].positionY =
       thisParticle.positionY + thisParticle.velocityY;
}

Writing to the Output Texture (9:53)

The other thing we need to do is actually draw those particles to a texture. We know the position of our particle as a uint, a two dimensional coordinate, and we have a float outColor that has RGBA arguments for yellow. We can then write that position with the yellow outColor.

kernel void particleRendererShader(
     texture2d<float, access::write> outTexture [[texture(0)]],
     const device Particle *inParticle [[ buffer(0) ]],
     uint id [[thread_position_in_grid]])
{
  const uint2 particlePosition(inParticle[id].positionX,
    inParticle[id].positionY);

  const float4 outColor(1.0, 1.0, 0.0, 1.0); // RGBA yellow

  outTexture.write(outColor, particlePosition);
}

Accessing the Updated Array in Swift (10:24)

Once the kernel function is completed, the outVectorBuffer is populated with all the updated particles. We need to write that data back to Swift, so we can look at it, or parse it for the next frame. This is done by creating an instance of NSData, and using getBytes to populate the data array with all the updated particles from the shader.

var data = NSData(bytesNoCopy: outVectorBuffer.contents(),
     length: particles.count *sizeof(Particle), freeWhenDone: false)

data.getBytes(&particles, length:particles.count *sizeof(Particle))

Converting the Output Texture to a UIImage (10:49)

In a similar vein, I need to take the texture we wrote previously, and, using CGImageCreate with getBytes to provide a reference, turn that into a UIImage. This looks static. At 25 frames per second, I’ve split up the 250,000 particles into Red, Green, and Blue classes to show different effective masses.

Shared Memory (12:15)

This is good, but can it be better? It turns out that the main performance bottleneck in this is not doing all the math, but moving the data between the GPU and CPU. How could we improve that?

There’s a technique that uses shared memory between the GPU and CPU, which means there’s no data being copied. The function posix_memalign allocates memory for use by both GPU and CPU. There’s a small amount of extra overhead here, we need to search through code that uses UnsafeMutablePointers and COpaquePointers, and invoke that function. Once this is working, I just copy and paste as needed. To populate the array of particles, rather than looping over the array as I did previously, I populate the particlesParticleBufferPtr — this time I’m using newBufferWithBytesNoCopy, which is a buffer with memory already allocated. Importantly, the code inside my kernel shader in Metal hasn’t changed at all. The API for Metal is identical.

posix_memalign(&particlesMemory, 0x4000, particlesMemoryByteSize)

particlesVoidPtr = COpaquePointer(particlesMemory)
particlesParticlePtr =
UnsafeMutablePointer<Particle>(particlesVoidPtr)
particlesParticleBufferPtr = UnsafeMutableBufferPointer(
      start: particlesParticlePtr, count: particleCount)

Image Processing Inside the Particle Compute Shader (13:34)

If we want to add things like trails or glows to our particles, or if there are more particles than pixels in the image, we can take that one dimensional id and turn it into a two dimensional coordinate using the modulo operator. I loop over adjacent pixels, roughly averaging them together to get a slight blurry-glow effect. The net result is that, by using the shared memory between the GPU and CPU, and by having the shader do two things at once — both the particle math and the post-process — I get up to 2 million particles at 25 frames per second with nice post-processing. As before, there are three classes of particles with different masses.

Advanced Particle Systems (14:45)

For the next step, I looked at more complex particle systems. This is a technique called Swarm Chemistry, by Hiroki Sayama, who researches collective dynamics and complex systems. I’ve applied his mathematics; we’re not limited to a thread_position_in_grid, and we can loop over all the other particles to use that information. For example, we could look at the distance between particles, see if they have similar properties, or determine whether they should repel or attract each other. From there, I’ve deployed a bigger shader with 155 lines of code, and 4,096 particles. At 16.7 million particle interactions per second, I’m still getting 30 frames per second, which is impressive. If you’d like to see the results, I’ve released it as part of an open source app called Emergent.

Acknowledgments (16:23)



Simon Gladman

Simon Gladman

An early champion of Apple’s Swift language, Simon’s blog flexmonkey.blogspot.co.uk has become a popular source for articles exploring and experimenting with iOS technology. Simon has published several iOS apps including Nodality, a node based image editing app for iPads. In his spare time, his coding takes a more creative direction where his interests include topics such as physics and particle simulations, image processing and novel user interaction patterns.