GPT-4 was trained on about 500 billion words – essentially all good-quality, publicly available text. The performance of deep learning models is generally driven by increasing model complexity and amount of training data. This has led to the question of how further improvements could be achieved, since we have almost run out of new training data for language models. However, multimodal models open up enormous new reserves of training data – in the form of images, audio and videos. AIs such as Gemini, which can be directly trained on all of this data, are likely to have much greater capabilities going forward. For example, I would expect that models trained on video will develop sophisticated internal representations of what is called “naïve physics”. This is the basic understanding humans and animals have about causality, movement, gravity and other physical phenomena.