Running Baby llama2 Model on iOS with Illustrate Llama App Link to this heading

Introduction Link to this heading

In this blog post, we’ll walk through the technical steps involved in running the Baby llama2 model from the llama2.c GitHub repository on an iOS app called Illustrate Llama. We’ll cover the process of exporting the model to ONNX format, integrating it into the iOS app, and the challenges we faced along the way.

Exporting the Model to ONNX Format Link to this heading

The Challenge Link to this heading

The first step in our journey was to export the pre-trained Baby llama2 model to ONNX format. However, we encountered a roadblock: the llama2.c project did not initially support ONNX export. The primary reason was that ONNX did not support the Complex64 data type, which was used in the codebase.

For more context, you can refer to the GitHub issue here.

Our Solution Link to this heading

To overcome this challenge, we submitted a Pull Request (PR #103) to the llama2.c repository. The core idea behind the PR was to decompose the operations involving complex numbers into separate operations for the real and imaginary parts.

Code Changes for ONNX Export Link to this heading

To resolve the issue with the Complex64 data type, we made several changes to the codebase. Below are some of the key modifications:

Replacing Complex Numbers with Real and Imaginary Parts Link to this heading

Originally, the code used complex numbers for certain calculations. We replaced these with separate real and imaginary parts.

python
1# Original code
2freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
3
4# Modified code
5freqs_cos = torch.cos(freqs)  # real part
6freqs_sin = torch.sin(freqs)  # imaginary part

Modifying the Rotary Embedding Function Link to this heading

The apply_rotary_emb function was also modified to accommodate the changes.

python
1# Original code
2xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
3xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
4
5# Modified code
6xq_out_r = xq_r * freqs_cos - xq_i * freqs_sin
7xq_out_i = xq_r * freqs_sin + xq_i * freqs_cos
8xk_out_r = xk_r * freqs_cos - xk_i * freqs_sin
9xk_out_i = xk_r * freqs_sin + xk_i * freqs_cos

Updating the Forward Method Link to this heading

The forward method in various classes was updated to use the new real and imaginary parts.

python
1# Original code
2h = layer(h, freqs_cis)
3
4# Modified code
5h = layer(h, freqs_cos, freqs_sin)

By making these changes, we were able to successfully export the Baby llama2 model to ONNX format without any issues related to the Complex64 data type.

Exporting to ONNX Format Post-PR Link to this heading

The ONNX Export Code Link to this heading

After successfully merging our Pull Request, we were able to export the Baby llama2 model to ONNX format. Below is the Python code snippet that demonstrates how to perform the export:

python
1torch.onnx.export(model,
2                  torch.from_numpy(input),
3                  "./model_128.onnx",
4                  verbose=True,
5                  input_names=["input"],
6                  output_names=["output"])

In this code snippet:

  • model: The pre-trained Baby llama2 model.
  • torch.from_numpy(input): The input tensor converted from a NumPy array.
  • ./model_128.onnx: The path where the exported ONNX model will be saved.
  • verbose=True: Enables verbose output to understand the export process.
  • input_names and output_names: Specifies the names for the input and output nodes in the ONNX graph.

By running this code, the model is exported to an ONNX file named model_128.onnx, which can then be integrated into our iOS application.

Running the ONNX Model on iOS Devices Link to this heading

Utilizing MPSX Library Link to this heading

To run the ONNX model on iOS, we leveraged a closed-source modification of the MPSX library. MPSX is an excellent open-source project that allows you to load ONNX models on iOS using Swift and perform inference in a straightforward manner.

Enhancements to MPSX Link to this heading

We made extensive enhancements to the MPSX library to support a more comprehensive set of ONNX operators and offer a more flexible way to invoke the model. These modifications enabled us to integrate the ONNX model seamlessly into our iOS application.

Code Snippet: Loading and Running the ONNX Model Link to this heading

Below is a Swift code snippet that demonstrates how to load the ONNX model and perform inference:

swift
 1func testLlama2_fp32() {
 2    let path = "model_path"
 3    let onnxModel = try! OnnxModel(path: path + "/llama2.sim.fp16.onnx")
 4
 5    let inputConfigs = ["input": OnnxGraphInputConfig(shape: [1, 103], type: .int64)]
 6    let outputConfigs = ["output": OnnxGraphOutputConfig()]
 7    let globalConfig = OnnxGraphGlobalConfig(floatPrecision: .float32)
 8    let graphConfig = OnnxGraphConfig(inputConfigs: inputConfigs,
 9                                      outputConfigs: outputConfigs,
10                                      globalConfig: globalConfig,
11                                      gradConfig: nil)
12
13    let graph = try! OnnxGraphBuilder().build(onnxModel: onnxModel, config: graphConfig)
14    let input = Tensor.loadFromNpy(path: path + "/input.npy")!
15  
16    let output = graph.forward(inputs: ["input": input], outputs: ["output"])["output"]!
17}

In this code snippet:

  • OnnxModel: Class for loading the ONNX model.
  • OnnxGraphInputConfig and OnnxGraphOutputConfig: Classes for configuring the input and output shapes and types.
  • OnnxGraphGlobalConfig: Class for setting global configurations like float precision.
  • OnnxGraphBuilder: Class for building the graph for inference.
  • Tensor.loadFromNpy: Method for loading input data from an NPY file.

By running this code, you can perform inference using the ONNX model on your iOS device.