In James Zou’s most recent work, there is a brief but significant moment that keeps coming to mind. In simple terms, a user informs a chatbot that they think people only use 10% of their brains. The polished, well-mannered model doesn’t even acknowledge the belief. Rather, it instructs the user on the myth. It provides a helpful explanation of the claim’s lack of evidence. It fails to acknowledge that the person on the other end of the screen genuinely believes this to be true, which is the one thing a considerate human listener would almost instinctively do.
Zou and his colleague Mirac Suzgun contend that this gap is not an oddity. It is a structural blind spot that lies at the core of almost all current AI safety frameworks. They tested 24 of the most sophisticated language models using 13,000 carefully crafted questions in their study, which was based on a benchmark they named KaBLE. It was an unsettling pattern. Facts can be recited by models. They occasionally have a terrible time keeping track of what a specific person in front of them happens to believe.
| Lead Researcher | James Zou |
| Role | Associate Professor of Biomedical Data Science (and, by courtesy, of Computer Science and Electrical Engineering) |
| Institution | Stanford University, School of Medicine |
| Co-author on the study | Mirac Suzgun, JD/PhD student |
| Benchmark introduced | KaBLE — Knowledge and Belief Evaluation |
| Scope of study | 13,000 questions across 13 tasks |
| Models evaluated | 24 leading large language models, including GPT-4o and DeepSeek R1 |
| Most striking finding | GPT-4o’s accuracy dropped from 98.2% to 64.4% when handling false user beliefs |
| Affiliated centers | Stanford AI Lab; Chan-Zuckerberg Biohub |
| Broader context | 2026 AI Index Report documenting safety and transparency gaps |
| Notable awards | Sloan Fellowship; NSF CAREER Award; two Chan-Zuckerberg Investigator Awards |
This may seem like a small philosophical grievance. It isn’t. When a false statement was reframed as something the user personally believed, GPT-4o, one of the more capable systems available, fell from 98.2% accuracy to 64.4%. DeepSeek R1 fell even more, from over 90% to 14.4%. Strangely, the models handled the same false statement well when it was attributed to a third party. The user is the only one whose viewpoint may be most important, so the failure is unique to them.
Clinicians are looking at AI recommendations in between patient visits if you walk through any hospital these days. Attorneys insert contracts into chatbots. They are relied upon by teachers to create lesson plans. In each scenario, the model is essentially conversing with someone who is carrying a personal set of presumptions, half-formed anxieties, and partially recalled information from a podcast. Current safety frameworks, such as those listed in Stanford HAI’s responsible AI work, typically concentrate on transparency scores, fairness benchmarks, and hallucination rates. All of them are important. However, Zou’s argument is that although the systems are increasingly being used as collaborators, they are being evaluated like encyclopedias.

Reading the paper gives me the impression that the field has been measuring the wrong thing for some time. The number of documented incidents in the AI Incident Database increased from 233 in 2024 to 362 in 2025. In fact, transparency scores decreased. However, the benchmarks that dominate leaderboards continue to prioritize raw knowledge over the more nuanced ability to model another person’s mind.
In interviews, Zou takes care not to overpromote the solution. He acknowledges that training models to create representations of specific users carries genuine risks, the most obvious of which is stereotyping. A system that surreptitiously determines your personality type may fail in more detrimental ways than one that merely makes a factual error. It’s difficult to ignore how infrequently the safety discussion focuses on this specific tension as you watch this debate play out.
The larger argument, however, succeeds. The human on the other end of the conversation is the variable that existing frameworks consistently ignore as AI transforms from an autonomous tool to a collaborative partner. The models have a wealth of knowledge. They don’t yet have a solid understanding of you. And that may prove to be more important in determining whether these systems are truly reliable than benchmark scores or governance charters.
