Inference speed of large language models on different hardware

Apr 18, 2024

At work, we were testing some open-source LLMs, specifically experimenting with a 34B model. The conventional wisdom suggests that expensive GPUs are necessary to efficiently run these large models. It seems like every hardware manufacturer is rushing to integrate some form of AI accelerator into their products. Intel and AMD refer to them as AI engines, Apple labels theirs as the neural engine, and Snapdragon simply calls it an AI accelerator. My point is that the computational capabilities of traditional hardware seems to be not be sufficient to handle large models. However, when testing the dolphin model on a seven-year-old laptop with a 4-core i7-7700HQ CPU, I obtained the exact same performance as on a 32-core EPYC 7742, around 0.5 tokens per second. However, a new laptop, with still far less compute than a 32-core EPYC, almost doubles the performance, achieving approximately one token per second.

Plattformtoken/s(theoretical)max Bandwith in GB/s
EPYC 774200.53???
i7-7700HQ00.637.5
i9-13980HX01.0689.6
Mac M3 192GB16.28800
A100 40GB26.471500

As we can see from the table, the performance of large models is entirely dependent on the bandwidth of the memory to the compute unit. The computational capabilities or the speed of the AI accelerator become basically irrelevant the moment the models get larger. This, of course, makes sense. Let’s take the dolphin model we used for testing, which is roughly 40GB in size. In order to generate one token, we need to access the entire model at least once. As such, the absolute maximum a machine with a 40 GB/s bandwidth can achieve is one token per second.

This also explains why quantization leads to such impressive performance gains on large models. At the same time, this means that the performance gains of quantization on smaller models are significantly lower. As such, quantization is unambiguous on large models and barely existent on small models. For example, I went through the ordeal of quantizing the e5 model only to find that the performance gain is marginal and the accuracy drop is significant.

Finally, since the inferences are entirely bound by memory bandwidth, I would not be looking for hardware with high computational capabilities, but rather for hardware that offers a good memory bandwidth-to-cost tradeoff.