The recent MMLU-Pro all category test results for Llama 3 70B Instruct models have brought some intriguing insights to the forefront. Conducted using Bartowski's GGUFs across various quantization levels (q2_K_XXS, q2_K, q4_K_M, q5_K_M, q6_K, and q8_0), these tests aimed to evaluate model performance across multiple domains such as Business, Law, Psychology, and more. The results reveal that while higher quantization levels like q8_0 would be expected to perform better, this wasn't always the case.

In the Business category, for instance, the fp32-3_K_M model outperformed others with a score of 55.89% correct answers, while the fp16-Q4_K_M model came close with 54.12%. This suggests that while higher precision models generally perform better, the difference narrows significantly at higher quantization levels. Interestingly, the fp32 quants made by the user themselves for untemplated tests also showed strong performance, particularly in math-heavy categories.

The Law category results also followed a similar pattern, with the fp32-Q8_0 model achieving a 44.41% score, slightly outperforming the fp16-Q4_K_M model at 42.78%. This highlights that some categories might benefit from higher precision models, but the gains are marginal. Moreover, in categories like Psychology and Health, the results leaned towards higher quantization levels, with fp32-Q6_K achieving 76.32% and 70.42% respectively, demonstrating a clear edge over lower quantization levels.

An unexpected takeaway was the performance disparity between different hardware setups. Despite expectations, the Nvidia H100s did not significantly outperform the 4090s for single-user tasks, though they excelled in handling parallel requests. The Nvidia setups were generally 50-100% faster than the M2 Ultra Mac Studio, and 300% faster than the M2 Max MacBook Pro, showing a clear advantage in raw computational power.

These results underscore that while higher quantization levels often yield better scores, the differences are sometimes minimal and can be influenced by the nature of the test category. This makes q4_K_M a reliable middle ground for many applications, balancing performance and computational efficiency. As testing continues, especially with models like WizardLM 8x22B on the horizon, further insights will help refine these findings, making the choice of model and quantization level more data-driven.