In a recent discussion, Jensen Huang, CEO of NVIDIA, seemingly hinted at some intriguing details about GPT-4, suggesting it might be a 1.8 trillion parameters model trained on 8 trillion tokens. This revelation has sparked a flurry of speculation and excitement among AI enthusiasts and experts alike. While the 1.8T MoE (Mixture of Experts) theory has been circulating for a while, the mention of an 8T training dataset size is relatively new and has not been widely confirmed until now.
One of the top comments on the Reddit thread points out a subtle but important distinction in Jensen's wording: he said "train a..." rather than "train the...". This could imply that he was speaking hypothetically or providing an example based on scaling principles rather than confirming specific details about GPT-4. This nuance leaves room for interpretation and further speculation about the exact nature of GPT-4's architecture and training data.
Another interesting observation from the discussion is the comparison to Llama 3, which reportedly uses 15 trillion tokens for training an 8 billion parameter model. This comparison highlights the vast scale and complexity involved in training state-of-the-art AI models. The sheer volume of data and computational power required underscores the advancements and resources needed to push the boundaries of AI capabilities.
As the AI community eagerly awaits more concrete information, it's clear that any confirmation about GPT-4's parameters and training data will have significant implications for the future of AI development. Whether Jensen's comments were a slip or a strategic hint, they have certainly reignited interest and debate about the next generation of AI models.