‘Visual’ AI models might not see anything at all
A study reveals that AI models like GPT-4o and Gemini 1.5 Pro struggle with simple visual tasks, challenging their 'vision' capabilities claims.
A study conducted by researchers at Auburn University and the University of Alberta investigates the visual understanding capabilities of multimodal AI models like GPT-4o and Gemini 1.5 Pro. Their experiments included basic tasks such as determining if shapes overlap or counting specific objects, where even a first-grader would excel.
Results showed that these AI models performed poorly on these straightforward visual tasks, like accurately identifying overlapping circles. GPT-4o, for example, succeeded only 18% of the time with closely spaced circles, and Gemini Pro 1.5 only managed 70% accuracy.
The researchers suggest that these models do not have genuine visual understanding but instead match patterns to training data. This was evident in the models' inconsistent ability to count interlocking rings beyond the familiar five-ring Olympic logo, indicating a lack of true visual judgment capability.