The latest review of IEEE Spectrum, which studies AI systems, has the results of several papers: models that solve quizzes and describe pictures without problems get stuck on simple clocks with two hands. And when they recognize the dial, they often add up the angles incorrectly or replace the hour and minute hands, giving the wrong time. The authors point out that it is a good “X-ray” of the limitations of today’s systems, because the task requires precise vision and basic spatial locking.
One of the new tests is ClockBench: 180 clocks and 720 questions, from classic numbers to Roman numerals and stylized hands. Untrained people achieve about 89% accuracy, while top models lag significantly behind. The researchers conclude that even “thinking in steps” does not help if the visual perception is uncertain, because even small movements of the pointers and “funny” designs ruin the result.
A team from the University of Edinburgh has similar findings: the models often misinterpret the positions of the hands, and when the task is extended to calendars, the errors increase. The conclusion is that current systems guess patterns more than they “understand” the rules of geometry and time, so they are sensitive to details that people easily overlook.
One paper specifically analyzed GPT-4.1 and showed that the result can be improved by targeted fine-tuning, but even then the task remains sensitive to distorted numbers and atypical hands. In other words, “setting the clock” for AI is not yet a solved problem, it can only be mitigated to some extent by training.
Such tests are a nice reminder that models are not universally reliable “sight and brain”, but a set of abilities with holes. If the AI in your application needs to interpret instruments, numbers, calendars or diagrams, it should be specially trained and checked, but also retain human supervision.