Monday, July 7, 2025

What is multimodal AI? - McKinsey

Multimodal AI is a type of artificial intelligence that can understand and process different types of information, such as text, images, audio, and video, all at the same time. Multimodal gen AI models produce outputs based on these various inputs. Multimodal models mirror the brain’s ability to combine sensory inputs for a nuanced, holistic understanding of the world, much like how humans use their variety of senses to perceive reality. These gen AI models’ ability to seamlessly perceive multiple inputs—and simultaneously generate output—allows them to interact with the world in innovative, transformative ways and represents a significant advancement in AI. By combining the strengths of different types of content (including text, images, audio, and video) from different sources, multimodal gen AI models can understand data in a more comprehensive way, which enables them to process more complex inquiries that result in fewer hallucinations (inaccurate or misleading outputs).