After a week of meticulous planning, a couple of weeks waiting for parts from various sources like eBay, Amazon, and TitanRig, and days of troubleshooting and BIOS modding, I finally completed my "budget" (<$2500) 96GB VRAM inference rig for Ollama. The goal was to achieve this massive VRAM capacity using P40s while keeping the noise levels down. While it could have been cheaper, it was still significantly less expensive than achieving similar VRAM capacity with newer hardware.
The specs of my build are quite impressive: an ASUS X99-E-10G WS motherboard, an Intel i7 6950x CPU, 128GB of memory, and a combination of one Nvidia Quadro P6000 and three Nvidia Tesla P40 GPUs. To power this beast, I used an EVGA Supernova 1000w power supply. The cooling system is a custom liquid cooling setup with EKWB Thermosphere GPU blocks, a Quad Scalar Dual Slot, and various custom 3D printed brackets and mounts. This setup ensures that the rig runs quietly without compromising performance.
Despite the initial challenges with BIOS and OS configuration, the build has been a success. With stock settings, I'm achieving around 7 tokens per second with LLaMa3:70b Q_4 in Ollama, with plenty of VRAM headroom left. I'm excited to test out larger models and will provide updates on the performance.
For those interested in the technical journey, I documented my experience getting all four GPUs to run on the X99-E-10G WS motherboard in a detailed Level 1 Tech forum post. This includes a guide on flashing the motherboard with ReBAR support, which is crucial for running more than 48GB of VRAM. If you're planning a similar build, this information could save you a lot of time and frustration.
In summary, achieving 96GB of VRAM for under $2500 is a remarkable feat, especially considering the cost of newer hardware. This rig not only meets my performance needs but does so quietly and efficiently. If you're considering a similar build, be prepared for some technical challenges, but the results are well worth the effort.