How do we understand when AI chatbots fail at supporting users’ mental health?

Bridging the gap in understanding between clinicians and AI researchers

Apr 29, 2026

[I wrote this article without using AI for writing support, except for looking up some references that I cross-checked.]

We’ve all heard about the heartbreaking cases where people have completed suicide, and their interaction with an AI chatbot substantially aided that process (CBS News, 2026; CNN Business, 2026). “AI-induced psychosis” is becoming a well-known term (Nichols et al., 2026; Jutla et al., 2025). How do we, as clinicians, approach the issue of clients using AI?

I am glad to see that clinicians started speaking out about the dangers that AI use poses. Some voices I hear are clinicians wanting to tell clients to just stop using AI for mental health support. I understand the sentiment. Human connection is paramount for healing. But the reality is that many clients won’t stop. AI is becoming so embedded into our lives that it’s already nearly impossible not to interact with it in our everyday life. Personally I want to understand more about what’s happening and how, with a goal to help clients who use AI to do it in the most supportive way possible.

One issue at the intersection of AI and mental health is that AI is an advanced technology, and a lot of us, clinicians, are not versed in it, let alone feel comfortable with understanding the advances. (Given how fast this technology is advancing, I suspect very few people feel “on top of it” when it comes to the advances of AI technology.) The AI world has its jargon and its way of thinking that create a divide between what AI researchers are doing in the area of mental health and what’s accessible to mental health clinicians.

I want to bridge this gap a bit here, and describe some research in the AI domain about when and how AI fails in mental health contexts. I also want to convey to clinicians that there are researchers working to make AI safer in terms of mental health, which I suspect isn’t exactly common knowledge. Also, the results of the research I am discussing aren’t very reassuring. However, some improvements are happening.

What am I talking about? Some researchers have created frameworks that evaluate how well different AI chatbots perform when interacting with users with different psychological issues. I will focus on three evaluation frameworks here: SIM-VAIL (Weilhammer et al., 2026), VERA-MH (Bentley et al., 2026), and MindEval (Pombal et al., 2025). The names are just acronyms for the frameworks that researchers came up with (see Reference section at the bottom for more info on what these stand for) . These are the three latest evaluation frameworks that I came across.

The general idea is to have an evaluation framework (or a benchmark) that can be used to give AI chatbots a score on how well they do when users chat to them about their mental health questions/issues. AI developers can use these evaluation frameworks to evolve safer AI chatbots in the mental health context. In fact, I just read that a new wellness app called Flourish reported that they had the highest score on VERA-MH as a way to indicate how safe it is to use (Xuan Zhao, April 22, 2026).

I will begin with the overall results across the three studies. They are not very reassuring about the ability of the most common AI chatbots (which are most widely used) to handle mental health issues. For example, the SIM-VAIL article states “…we found evidence of concerning chatbot behavior across virtually all user phenotypes and most of the 9 consumer AI chatbots audited, albeit reduced in newer models” (p.1, Weilhammer et al., 2026). The most concerning AI chatbot behavior was found for users exhibiting mania, psychosis and depression, especially if what they sought is “glorification” (which is kind of like ego boosting of whatever harmful behavior they wanted to engage in), which created risky action and dependence on the AI chatbot (“you are the only one who understands me” kind of thing). SIM-VAIL results show that Grok 4 performed worst, and Claude 4.5 Sonnet performed best.

The interesting point raised by creators of SIM-VAIL (Weilhammer et al., 2026) is that sometimes they saw trade-offs where interventions that helped for some mental health issues were actually harmful to other mental health issues, as they put it “reducing overt harm-enabling behavior at the cost of promoting emotional dependence” (p. 3, Weilhammer et al., 2026). This points to the importance of interventions that are based on the context, a topic that is very familiar to clinicians.

Another evaluation framework, VERA-MH, found that overall, 10-30% of AI chatbot responses were evaluated to have “high potential for harm.” (Supplemental Figure S1, Bentley et al., 2026), with GPT 5.2 most likely to have potentially harmful responses, and Claude Sonnet 4.5 least likely to do so. That’s concerning. As a therapist, my own threshold for statements that I would consider “high potential for harm” is extremely low. If I say something in a session that I would later evaluate as “high potential for harm,” I would subsequently focus a whole session (or 10!) on fixing that.

The MindEval framework found that the models weren’t doing too well either. “All systems [AI chatbots] score below 4 [out of 5] points on average on MINDEVAL, with performance landing between 2.16 (Qwen3-4B-Instruct) and 3.83 (Gemini 2.5 Pro), indicating that even frontier models are likely unsuitable for mental health applications“ (p. 8, Pombal et al., 2025). A quote from MindEval by a clinician evaluating an AI chatbot: “The AI kept patients in their comfort zone, prioritizing the removal of any pressure, emphasizing micro-interventions, avoiding deeper emotional or behavioral work, and using language that discouraged engaging with discomfort. This reinforced avoidance, signaled fragility and ‘unsafe to feel uncomfortable’, and created conditions unlikely to produce meaningful therapeutic change” (p. 9, Pombal et al., 2025). This is a great description of sycophantic behavior (a term used to describe that AI chatbots are designed to please the user and support them in whatever they to do without much reservation). This behavior has been identified as one of the biggest obvious dangers of using AI chatbots for therapy (Chandra et al. 2026).

There is one very important thing to understand about these frameworks. Here is how they work:

“Users” are simulated (they’re not human!) using AI.
“Users” then talk to an AI chatbot that’s being evaluated (that’s your standard ChatGPT or Claude, etc.).
Then there is an AI safety judge (another AI!) that evaluates the produced conversations between the “user” and the AI chatbot for how safe they are in terms of supporting “user’s” mental health.

So, it’s AI talking to AI, then evaluated by AI. Granted, all frameworks have tried to produce realistic users and test their AI safety judge to be a good evaluator.

As I see it, there is one particularly glaring issue that I see with these frameworks. These studies involved very little to no human clinical input. From what I could see, SIM-VAIL used no human clinicians to evaluate whether their AI safety judge was aligned with human judgement. MindEval used four clinicians to evaluate alignment of their AI safety judge. The study that collaborated most with clinicians was VERA-MH, and they used six clinicians to cross-check whether their AI safety judge did well. Moreover, for the VERA-MH study, the biggest point of disagreement between clinicians was when one clinician thought a response had “high potential for harm” while another clinician considered the response was “best practice.” That seems problematic to me.

As clinicians, we all know that therapists are very different. They work differently, they evaluate things differently. There are 10 clinicians across these three studies that served to represent our whole profession in terms of safety of interactions. I don’t doubt that these clinicians were highly competent, but that’s a very small number of human therapists making a potentially enormous impact on deciding what is safe for literally millions of humans.

A second issue is that these frameworks evaluate conversations for a relatively short length of conversational turns (a “turn” means that the user types something, and then an AI chatbot responds). MindEval and VERA-MH evaluated conversations for 20 turns of conversation, and SIM-VAIL evaluated it for 10 turns. That’s what, about half an hour or an hour of conversation? People, especially when they are relying on AI chatbots for support, can use them daily and for several hours at a time (e.g., Hau and Winthrop, 2025; Rosenbluth, 2026). Researchers are identifying that and addressing the issue of needing to understand how user-AI chatbot conversations evolve over time in order to ascertain a more realistic risk (e.g., Nicholls et al. 2026).

Overall, I was excited that there are researchers focused on this! Without knowing about all this research, I felt much more helpless about the progression of AI. I would love for therapists to get curious about AI research, and for that we need to be able to understand it. That’s why I am writing this article. My call for therapists and clinicians is: let’s understand AI and help our clients use it wisely.

What I would also love to see is more collaboration between the AI researchers building these frameworks and therapists/other clinicians. The research that I am seeing is great, but in many ways it’s a bit simplistic in the way it views the therapeutic process. I was glad to see that I am not alone in this, as I was reading Zhao’s article (Zhao, 2026). In my experience and opinion, therapeutic work is as much art as it is science, and understanding how it works is not reducible to rubrics and categories (although that does help). It’s nuanced and complex, nonlinear and surprising. It’s also awe-inspiring and alarming, just like the artificial intelligence that we are creating.

I want to share and invite dialogue on some questions that I am deeply interested in right now. When does the transition from using AI for work and logistics to using it for emotional support occur? For whom does it occur most often? What are the most effective interventions, on societal, technological, and clinical levels that aid humans in using AI in a supportive way? How do these interventions vary by different populations (teenagers, elders, people managing psychiatric disorders, underserved populations, populations of different ethnicities, races and cultures, etc.)? How do we create AI-human integrated systems where humans can easily step in when AI isn’t the right option?

I hope that my summary made sense to the clinicians who are reading it. If not, let me know! I want to talk about this. I want us to step into this time of transformation with open eyes, and understanding the risks while also feeling awe at what’s happening. Most of all, I want us to work on this together.

Leeza's Substack

Discussion about this post

Ready for more?