Can You Run a Large Language Model On iPhone?

Answer:

I’ve been digging into how different iPhone models stack up when it comes to running large language models (LLMs). Here’s a breakdown:

iPhone 12 Series (A14 Chip)

So, the iPhone 12 lineup with the A14 chip (2+4 CPU cores, 4 GPU cores) looks pretty solid for running LLMs like TinyLlama 1.1B. Check out these numbers

  • Prompt Processing (PP): F16 scores 251.98 t/s, Q8_0 does 250.54 t/s, and Q4_0 gets 242.37 t/s.
  • Token Generation (TG): It’s 10.26 t/s for F16, 24.11 t/s for Q8_0, and 39.21 t/s for Q4_0.

But for Phi-2 2.7B the phone can only run inference on the 4-bit quantized model with 8.52 t/s and process the prompt with around 51 t/s. 

iPhone 13 Pro & Pro Max, iPhone 14 & Plus (A15 2+4 Cores, 5 GPU)

Moving on to the iPhone 13 Pro series and the 14s – they’re stepping it up. These are the TinyLlama 1.1B numbers:

  • Prompt Processing (PP): In F16, they hit 531.03 t/s; Q8_0’s at 494.18 t/s; and Q4_0’s at 496.49 t/s.
  • Token Generation (TG): They score 13.66 t/s in F16, 23.84 t/s in Q8_0, and 39.09 t/s in Q4_0.

For Phi-2 2.7B, they do pretty well too, with 120.47 t/s in Q8_0 PP and 16.73 t/s in Q8_0 TG. 

These are pretty good numbers, especially for a 2.2B model.

iPhone 14 Pro & Pro Max, iPhone 15 & Plus (A16 Chip)

Now, the A16 chip is a game changer. The TinyLlama 1.1B numbers are:

  • Prompt Processing (PP): Scores are 565.68 t/s in F16, 511.30 t/s in Q8_0, and 505.52 t/s in Q4_0.
  • Token Generation (TG): It’s 20.06 t/s in F16, 34.30 t/s in Q8_0, and 54.24 t/s in Q4_0.

And for Phi-2 2.7B, they’re pretty good too: 

119.58 t/s for prompt processing in 8-bit quantization and 14.06 t/s for token generation also in 8-bit. The 4-bit quantized model is even better – 121.64 t/s  PP and 23.31 t/s TG.

iPhone 15 Pro & Pro Max (A17 Pro Chip)

The 15 Pro series with the A17 Pro chip is just top-notch and can run even a 7B model.

  • Prompt Processing (PP): TinyLlama 1.1B gets 683.95 t/s in F16, 637.14 t/s in Q8_0, and 646.06 t/s in Q4_0.
  • Token Generation (TG): Here, it’s 20.23 t/s in F16, 35.60 t/s in Q8_0, and 56.86 t/s in Q4_0.

For Phi-2 2.7B, they’re leading with 158.03 t/s in Q8_0 PP and 14.74 t/s in Q8_0 TG. For 4-bit quantization we have 157.33 t/s PP  and 24.71 t/s TG.

They can even handle the tougher Mistral 7B (80.55 t/s in Q4_0 PP and 9.01 t/s in Q4_0 TG).

iPhone’s LLM running capability really depends on the model. The newer ones, especially with A16 and A17 Pro chips, are pretty efficient and powerful. But remember, as cool as it is to run these models on iPhones, they’re still not a match for Mac or PC!