Llama.cpp Unleashed: Distributed Inference Across Multiple Machines!

Exciting news for AI enthusiasts! Llama.cpp now supports distributed inference across multiple machines, thanks to the recent integration of rgerganov's RPC code. This update replaces the old MPI code, enabling you to run models across more than one machine. While this feature is still a work in progress and has some limitations, it opens up new possibilities for leveraging multiple devices to enhance your AI projects.

One notable limitation was the lack of support for quantized models. However, there's a workaround: by commenting out a specific line in ggml-rpc.cpp that asserts out when running a quantized model, you can enable quant support. This tweak requires recompiling the code, but it appears to work, although there might still be underlying issues. This flexibility allows you to experiment with quantized models and potentially improve performance.

Network bandwidth is a critical factor in distributed inference. Using a 1 gigabit ethernet connection significantly outperforms slower Wi-Fi connections. The overall speed is often limited by the slowest machine in your setup. For instance, when running a Tiny Llama FP16 model, the inference speed was capped at around 46 tokens per second (t/s) due to network constraints, even though both the Mac and 7900xtx were faster locally.

For those looking to maximize performance, consider upgrading your network infrastructure. Ethernet connections are preferable over Wi-Fi, and investing in higher bandwidth options like 10 gigabit ethernet can further enhance your setup. This update not only makes distributed inference more accessible but also encourages creative uses of idle hardware, such as utilizing additional GPUs from old gaming laptops or setting up collaborative AI projects at LAN parties.

In summary, the latest updates to Llama.cpp bring distributed inference capabilities, support for quantized models with a simple tweak, and highlight the importance of network bandwidth. These advancements pave the way for more efficient and collaborative AI endeavors, making it an exciting time for AI developers and enthusiasts alike.

Llama.cpp Unleashed: Distributed Inference Across Multiple Machines!

User's Guide to AI

Top Posts

About Us

Our Mission