When it comes to speed, llama.cpp outperforms ollama by a significant margin, running 1.8 times faster. This performance boost was observed during a benchmark test on the same machine (GPU) using the same quantized model. Specifically, ollama managed around 89 tokens per second, while llama.cpp achieved an impressive 161 tokens per second.
This disparity in speed might come as a surprise to some, especially considering that ollama is built on top of llama.cpp. One possible explanation is that the containerization of ollama could be introducing additional overhead, slowing it down. Additionally, llama.cpp is known for its continuous improvements and optimizations, which might not be immediately reflected in ollama's packaged version.
Community insights suggest that compilation methods and precompiled binaries can also impact performance. Precompiled binaries often sacrifice specific processor optimizations for broader compatibility, potentially leading to slower speeds. Moreover, llama.cpp frequently incorporates the latest innovations in model support, inference, and quantization speeds, giving users more control over the model settings compared to some wrappers.
For those looking to maximize performance, it might be beneficial to bypass wrappers and directly integrate llama.cpp with OpenAI-compatible plugins and applications. This approach not only leverages llama.cpp's speed but also avoids the potential slowdowns associated with containerization. Ultimately, for cutting-edge performance and control, llama.cpp remains a superior choice over ollama.