Upgrading to dual RTX 3090 GPUs has significantly boosted my performance when running Llama 3 70B 4b quantized models. With a single RTX 3090, I was achieving around 2 tokens per second (t/s), but the addition of a second GPU has dramatically improved my results. Using Ollama, I now get approximately 16.95 t/s, and with vLLM, the performance jumps to an impressive 21.2 t/s. This 25% increase in performance with vLLM is particularly beneficial given the high electricity prices in Europe.

My current setup includes a Ryzen 3700x CPU, MSI X470 Gaming Plus motherboard, 48 GB of DDR4 RAM, and a single Corsair HX1000 1000W PSU from my old mining days. I opted for Pop_OS! with preinstalled NVIDIA drivers instead of Proxmox, as Proxmox would require a third GPU for video output and passthrough to VMs. To optimize power consumption, I set the power limit to 270W based on insights from the r/LocalLLaMA community.

For those looking to replicate this setup, running vLLM with dual RTX 3090s and a q4 quantized Llama 3 70B model can be achieved with the following command:

python -m vllm.entrypoints.openai.api_server --model casperhansen/llama-3-70b-instruct-awq -q awq --dtype auto -tp 2 --engine-use-ray --gpu-memory-utilization 0.93

This setup has been thoroughly tested and has proven to be highly effective. If you're considering a similar configuration, this command should help you get started without the hassle of extensive documentation digging.

Lastly, I appreciate the community's support and shared knowledge, especially from r/LocalLLaMA. If you're planning to use this setup for multiple API requests or more intensive tasks, benchmarking under load would be a good next step to ensure it meets your needs. Happy computing!