> They performed shockwave therapy on my shoulder even though a recent clinical practice guideline says clinicians should not use or recommend shockwave therapy for rotator-cuff tendinopathy without calcification; I was told during ultrasound that there was no calcification.
Ultrasound isn't a great way to assess for calcification. It'll find large calcification but easily miss small ones. Plain radiograph would be more helpful, but the MRI may have revealed it as well. Either way, shockwave therapy isn't harmful in the absence of calcification--it's just not helpful.
Edit: when a radiology report says something isn't present, there's always an implicit caveat that the finding isn't present within the context of the modality and images obtained. So an ultrasound report can state there are no calcifications while a plain radiograph can report the presence of calcifications without being inconsistent. Obviously very confusing to patients and people unfamiliar with medical jargon, but clarifying this in reports would make them sound even more qualified, "hedgey", and annoying to read than they already are.
There are other commenters saying this is a good practice they've also done for other injuries. You are saying you are an actual radiologist and immediately clock the problems with its advice.
I have seen this pattern over and over again. Anytime someone is an actual expert at anything, AI output appears insufficient or incomplete or outright misleading. It is only when you do not know what the AI is being asked to do is it likely you will find the output helpful.
This is itself alarming to me, but no one else seems to find this to be quite damning for the AI services being offered, preferring instanced to be wowed by the convenience and speed at which they can be delivered unreviewed and unproven information.
It is weirdly religious in a way, because if you were to present contrary evidence (e.g. experts in a field weighing in about how plausible sounding responses are bunk), you would only be told you don’t believe enough in the long term potential and capabilities.
Don’t get me wrong, I think we all agree capabilities will eventually improve (and farther-future capabilities could reasonably surpass experts), but really is unclear if the current transformer architectures with their probabilistic/hallucinatory outputs will plateau before they surpass current experts abilities in all promised fields.
A lot of the models up to this point have been benefitted - like Google did - from essentially ‘pre SEO’ internet.
Now the same tools are being used to generate nigh infinite good sounding bullshit, which poisons the dataset in all sorts of hard to detect ways.
To add insult to injury, the human experts are also not as. Naive, and have many incentives to poison their own input in subtle ways too.
OpenEvidence: more than 40% of U.S. physicians use it daily, and it handled around 20 million clinical consultations per month.
Over 100 million Americans were treated by a doctor using it in 2025.
This point is being raised in literally all discussions about llms for the whole last year, if not longer.
What it omits is the fact that these people getting suckered into the ai psychosis are using non-specialized models without an agentic loop while knowing nothing about the topic they're using the ai for.
That's down to the fact that this tech hasn't really been integrated yet and people are using them widely (and) irresponsibly, but it's not necessarily something you should blame LLMs for - the cause is likely more down to the model providers marketing and our collective tendency to like self affirmation / thinking they themselves know best.
We've known since the beginning that AIs confidently say incorrect things. But now that they can speak confidently about very complex topics, and mostly say correct things, we are letting our guard down and lots of subtle falsehoods are slipping through.
*In one case, I was able to put things back on track because the AI suggested my colleague talk to me; somehow it figured out we were co-workers.
Absolutely agree. Have seen this first hand
Yes, this is exactly so. AI is able to confidently sound plausible enough to convince laypersons or anyone who isn't very familiar with the subject matter, which is a big part of the mass-appeal "magic" of ChatGPT and other similar tools. It's like having a know-it-all friend (who also makes shit up to bridge their own knowledge gaps).
In many non-advanced non-specialized situations, AI is right enough to be at best useful or at worst not harmful (usually landing in the middle somewhere).
But speaking for myself, in areas where I consider myself quite proficient, I can very easily spot the subtle inconsistencies and naive conclusions that AI responses provide, and I have to guide/steer/correct it a lot to get good results when the subject matter is complex enough.
The LLM may have, from its "perspective", implicitly thought the OP was telling it that he had strong reason to believe there was no calcification and was not considering the bigger picture of possibly receiving an incomplete/poor assessment from the medical staff. In fact, the issue here may be the LLM overly trusting doctors vs. trusting its own expertise.
Similarly with LLMs, you can't just write them off entirely because they sometimes provide misleading or incorrect advice. The positive utility maximizing view is to learn when you need to call in an expert. I recently moved in to a new house and have used Claude extensively to figure out basic things (e.g., adjusting the garage door height, how to mount a TV). However, when the HVAC suddenly stopped working, I gave Claude a shot for an hour and tried some non-destructive fixes, but then realized I had to call in an HVAC expert.
I find Claude is surprisingly similar to a confident but incorrect coworker, with the benefit that Claude will reevaluate when I correct it.
"Be wowed by the convenience and speed", or merely "take advantage of the mere availability"? What most people find to be damning about expert advice is that they simply can't get it anywhere, at any cost that they can afford.
Who do you choose to be coached by an expert on the ground?
The first: Has no clue about anything and therefore no useful knowledge and cannot challenge me
The second one: He is known to give wrong information and do basic mistakes.
The LLMs will do their best, even if imperfect, since it summarizes what it read in book.
I prefer to be grounded on what Airbus / Boeing manuals, or on what pilots training book said, than two far more unreliable sources.
Ok for pain in your shoulder it might not, but how about a woman with a lump in her breast waiting for the mammogram interpretation? How about someone trying to understand disturbing lab results? People are also often pushed these days to move through visits with doctors at a breakneck speed, but the AI will "hear you out" all day.
Part of this is a problem with the AI, part of it a problem with our healthcare systems, and part of it is simply human nature. If you think that OpenAI, Anthropic, Google and the rest weren't aware of this going in you must have very little faith in the intelligence of their members. It's not hard to imagine the future of LLM's should involve a hell of a lot of liability on the companies running it, but for now it's the Wild West.
Whatever scenario you come up with my answer is the same.
As an adult I’d like to be able to choose what tools I use to learn about my condition regardless of how well it works or even if it’s likely to mislead me.
There’s risk in every aspect of life and we can’t baby proof everything.
Even if it "works" so poorly that you're not actually learning about your condition?
So if you MUST have answers that are at most random guesses, I'd suggest saving a few bucks and asking a coin before flipping it.
Then to say "Aha, but all of that is AI psychosis" makes obviously no sense: Why would we trust experts when they offer critique but not when they say "this is helpful"?
Overall: People are not insane. AI makes mistakes and, often, fails completely. AI also helps them do things better, quicker, increasingly so. The jaggedness of AI is confusing and real.
There is a huge difference between having a chance of a good result, which can be useful for experts able to filter out the bullshit, and consistent success. I would generate code as a helper, I would never allow a guy from marketing to merge unreviewed AI code.
For example, we had to advocate for certain practices during the birth of our first child that became routine during our second several years later.
So, neither side is guaranteed correct, doctor or citizen researcher (which did not include LLMs in my case, for the record). The truest answer is also the most useless one, applicable to all fields: it depends.
The real question is: if you embrace being a layman, whom do you trust more: LLMs/the internet or experts, like doctors? I think the answer is pretty clearly experts.
I.e. nothing this radiologist said was related to the LLM’s advice.
AI isn't even the first instance of this phenomenon, news articles are like this as well.
Welcome to the club? This new awareness you've found over the true quality of LLM based GenAI output has been what "all the haters" have been mad about for-ever. That the output of LLMs are clearly defective, and merely have found a cute trick towards making humans think they're less defective than they are actually measured to be.
And the corresponding anger and frustration to push the risks of genai output out onto others, while also aggressively pushing it as a feature you should be using already. You're behind don't you know, and whatever other lie I have to tell to trick you into enough FOMO to pay me 200USD/mo so I can sell FOSS back to you.
An LLM can only output the mean next likely token, and then add a bunch of extra noise on top of that so it feels interesting and not repetitive. None of this is new, the problem is, 50% of humans are below the mean, but have no idea. So when an LLM tells them some lie: well, it sounds so helpful! It's impossible for someone who sounds this helpful to lie to me, liars never sound confident! It must be PERFECT! I'm gonna tell everyone how perfect it is. so the bottom 0-33% think LLMs are fantastic tools that make nearly 0 mistakes in comparison to the bottom 33%. 33-66%-ish aren't sure, some times it's great, but it will make that random mistake sometimes, but I can catch most (or all of them depending on ego). and the 66%+ are angry about how many people are getting tricked by something so obviously low quality, or are lucky enough to not have to care.
Apply that to the Internet at large, and realize where LLMs got their training. They're basically ConfidentlyIncorrect personified.
Any comment that doesn't start with this or similar qulaification should be taken with a grain of salt (yes, including this one).
Medical imaging is one of those things everyone thinks is simple because they don't know what they don't know. I'm a cardiac sonographer, and I have to assume radiologists hear at least as many eye-rolling takes on AI coming for their job as I do.
Full sarcasm, is there one that’s that’s more immune?
Edit: I should mention that ultrasound is basically unusable for evaluating bones. Sound waves can't penetrate bone, and so you end up just seeing a huge black void. That's a huge orthopedics use case that ultrasound just can't benefit. However, ultrasound is fantastic for evaluating muscles, ligaments, tendons, and other superficial soft tissues.
Since MRIs are more expensive, private doctor's might order them instead of an ultrasounds.
(I'm a doctor)
This really is key. We know we can't trust the AI, but at the same time we're also more comfortable asking the AI for clarifications or confronting it. Not having a time-bound appointment or paying by the hour helps a lot. But even then, more information doesn't necessarily help!
I once brought my 11-year-old car, a Civic with 150k miles, to multiple garages. I figured I'd play the "second opinion" game to correlate what the garages recommended to decide on what needed to be done...
I got 3 completely unrelated recommendations, including one that I knew was invalid. I felt worse off than when I started!
The solution to uncertain information isn't more information, which the AI can certainly provide, it's better information, and AI cannot currently provide that.
I also had a pretty painful shoulder issue at one point, where the pain just wasn't subsiding for months. I tried massages and acupuncture as I didn't want to do surgery, but it wasn't helping at all. The thing that fixed it for me was just really focusing on doing pull-ups. I couldn't do them at all when I started, so I began with dead hangs and scapular pull-ups, eventually progressing to regular pull-ups, and then training with a "grease-the-groove" method once I could get a few per set. I stopped the training schedule once I was getting in around 17 pull-ups per set, and now just do 6 sets of about 7-8 pullups 3x per week spaced throughout the day. I'll also do some shoulder mobility drills [1].
Whenever I get lazy about keeping up with them inevitably discomfort will start arising again, but it goes away once I get back to strengthening.
It really seems like if you, as a patient, go looking for a quick fix, that’s what you’ll be offered. And if you educate yourself a bit and then go t for the best fix for you, you usually get they.
I wouldn't consider Claude itself to be the tool that does a job like this, but the tool that pulls in the best data and gives a supported suggestion. And then go through a number of iterations on where it failed to hone in its assessment.
On the plus side when they do this they can't flood your calendar with those "quick chat" meetings because they know they won't be able to hold a conversation on the issue beyond the first minute.
AI probably exacerbates it but crappy managers exist regardless
Dr. GPT is a good brainstorming tool. It helps synthesize information in a way that primary texts don’t. But it does force you to say “that doesn’t make sense”.
I do think that people saying “doctors don’t know the state of the art” have a weaker case. If you think about it in terms of token density during pretraining and how post training datasets are constructed, I think it would take us a very long time to adapt to any fundamental shifts. If we have forgotten how to cure scurvy, how many journal articles would it take before we adapt to a discovery?
Again, this is just one single person's experience. So not worth much.
I think we’ll see a lot of specialized VLMs that provide real value.
ChatGPT surfaced a NIH study that concluded that 20% of people have allergic reactions that are isolated to a body location, and that shoulder "skin prick" testing may not reveal. I asked him about that and he said "that's not how allergies work". Full stop. He was unwilling to even look at the study.
He prescribed a CPAP and regular nebulizer treatments. Side story: the CPAP place sent me a SMS message that I couldn't recognize was not a phishing attempt, and when I reached out to inquire who they were they never replied.
So I decided: Let me just try taking a second-gen allergy tablet every day and see what happens.
My sinus infections have gone away. Previously I was getting a major sinus infection at least quarterly. Maybe he's right that allergies don't work that way, but allergy tablets have absolutely solved my problem. Which I'm thankful for because I tried a CPAP for a solid month a few years ago and I just could not get used to it, and was sleeping like crap.
All I can find is about 1st gen antihistamines (i.e. Benadryl, which I doubt many people take daily, because of the drowsiness).
Even for those, evidence seems to be mixed at best. "Huge increases" seems like hyperbole.
Only first-generation antihistamines with anticholinergic effects are associated with cognitive decline in elderly patients.
Actually, I'm curious what ChatGPT 5.5's ELO is- I wouldn't be too surprised if it's 2000+ just from its basic understanding of chess principles from all the content it has digested.
Current Siemens MR software ‘Deep Resolve’ makes up the signal (adding about 50%), then makes up every second pixel, and then, for 3D sequences, makes up every second slice. It’s locking about 59% of the time off each sequences. And it’s really really good. I’m an MR tech.
And yea, I already did all the standard things. CBT for insomnia helped somewhat. My insurance didn’t fully cover it either, unless I was willing to wait for 8 to 12 months.
And I recently met someone with slow moving metastatic cancer. Thanks to LLMs they will most likely live another 3 to 5 years extra since the Dutch conventional mainline treatment hasn’t been taken yet. But it is German doctors that helped them and Belgian doctors that pointed out in a second opinion that a lot more can be done.
LLMs have a part to play. The false positives are awful, but I have seen an average of 5 out of 10 care when things become too complicated.
Except for trauma treatment. The Dutch healthcare system is amazing once they diagnose classic PTSD.
So it’s definitely not all bad but the trust I had when I was younger has been eroded quite a bit and LLMs can meaningfully step in, in my case at least.
[1] I know there are worse systems. But from what I have heard there are clearly better systems nowadays. It has slipped a lot
So 3 days out of 7 days I have guaranteed good sleep. The other 4 days are a toss up. But an average of 5 days of good sleep is much better than 3.5 days out of 7 days.
I told my mechanic the film flam is broken but he said it was the rim ram. He fixed it and we all went in with our lives.
But doctors insist on this God like status so it’s a “nightmare” when patients try to help themselves.
The same issues that were present with search-engine self diagnosis are still present with LLMs. If you provide Google with an incomplete list of symptoms and can’t interpret the information you find correctly, you will likely get an incorrect diagnosis. The same is true for LLM output.
Studies have found that newer reasoning AIs are about as good at diagnosing illness from a written description of symptoms as doctors are.
Granted, it cannot actually examine a patient, so we're not replacing doctors anytime soon. But your view is obsolete.
It may have some utility after diagnosis, but doesn’t demonstrate utility for patients.
The more training data, the more questions it can answer with a reasonable degree of probability of accuracy.
Throwing away a potentially useful analysis just because it’s probabilistic seems a bit like throwing the baby out with the bath water.
But AI's problem is that its completely full of shit, sometimes, and the people most qualified to evaluate whether its full of shit are the doctors, not the patients, but just like OP's original article, patients are left feeling like their second opinion from AI might be more trustworthy than their doctors opinion.
Examples of things normal people can verify
- procedural errors that Claude can capture like some blatantly high dosage (grams instead of milligrams)
- outdated treatment plan, maybe there’s a credible new treatment plan that’s been used for years but the doctors were not updated
- literally being injected homeopathic drugs (takes no smart person to flag this)
Let’s stop talking as if doctors have a divine right here. And let’s accept some agency.
The dad was a retired neuroscientist who delayed cancer treatment against medical advice because he was certain he had been misdiagnosed based on his own research that he did with the help of A.I.
https://www.nytimes.com/2026/04/13/well/ai-chatbots-cancer.h...
There's a comment on the article from Ben Riley:
> I am very grateful to Teddy Rosenbluth for sharing my father's story with the world, her kindness and curiousity proved to be restorative in ways I didn't anticipate.
> The two words that everyone used to describe my dad: "intelligent" and "kind," and he was indeed both of those things. The sad irony here is that it was his human intelligence, combined with these strange new tools that purport to be a form of 'artificial' intelligence, that led to his ill-advised decision to forego the treatment he needed for his CLL. A doctor has already commented on this story with the observation that AI "confidently asserts erroneous conclusions," and we simply have no idea how often this is happening or the magnitude of the harm that results.
> Not a day goes by that I don't feel the pang of my father's absence. He might still be here if not for AI. I try not to think about that, but sometimes I can't help myself.
This is the real root issue.
At 75 years old, he was stubborn. Is that reasonable ? Yes, perfectly. Could he have been right since the beginning ? Certainly. Did he deny evidence ? Yes.
Zero doubt that he was intelligent, everything points toward that direction, but that doesn't make a person less stubborn, because accepting the evidence, is also accepting that you were wrong if you initially postured yourself as adversarial instead of cooperative.
He would have read Wikipedia, scientific papers, etc, even without AI.
He did not want to be convinced. It works both ways:
https://www.foxnews.com/health/woman-says-chatgpt-saved-her-...
or
https://www.today.com/health/mom-chatgpt-diagnosis-pain-rcna...
Nonetheless, someone very smart, just didn't want to move from his position.
Your comment is akin to saying "Karen from facebook who is a human pushed essential oils and ivermectin as a cure to cancer. Now doctor Y is suggesting chemo. Both are humans, humans cannot be trusted!"
The clanker said I'd be fine, I just needed some rest and OTC meds.
The medical staff immediately turfed me to surgery because the same set of symptoms I told the clanker were enough to concern them that I needed emergency surgery.
Had I have listened to the clanker, I'd be dead because I did need emergency surgery. (Hell, I almost kicked the bucket because I waited for someone to wake up to give me a lift because.my insurance probably doesnt cover an ambulance ride.)
We need studies that quantify error rates from each source type, then we need to account for the fact that the artificial type will keep improving.
Pretty much the like most manager these days, so I understand the frustration of the GPs.
A con artist, a fraud
Like any domain, when you have questions or need a solution, you make research first, then you ask a specialist.
If you explain well the symptoms and context you can have proper advices and then decide on the path next:
Case A) It looks benign and advices / information that you collected seem reasonable, then you go your way.
Case B) You need second opinion of a specialist because the subject is too complex, or there are medications that you need approval.
Once you have challenged LLMs, and read about the topics over and over then you genuinely become really good at understanding it (especially if you triangulate over LLMs and ask them to challenge, you start to have genuine questions). No matter if the answer is right or wrong, you have elements. Maybe you missed the point, but you come prepared.At home you have the time to assess the options, pros and cons of each approaches, the possible questions to ask and then challenge the doctor.
Shared decision-making is an actual evidence-based model of care, and patients who arrive understanding their condition and carrying specific questions tend to get better attention and better outcomes.
Some doctors get annoyed, because they have big ego and choose to be patronizing, but it is exactly their job to answer such questions.
With LLMs, it's quite good, you get nuanced and rather useful answers.
Before LLMs, no matter the topic you searched for, the answer was the same: "you have cancer / an [obviously deadly] rare disease"
The other problem, in many places: • The doctors are not affordable
• They are too busy for you (< 15 minutes)
• You may need to wait months to get an appointment
• They are not good (country-side is an example, and sometimes even country-level)
+ you can have all of these factors together.So, you have something deeply bothering you, your only appointment is in 4 months. It would be insane not to take the time to explore different solutions and not to come informed about the topic.
If you express your prompt properly and do not rely on imagery, you can absolutely have top-tier advices.
Instead, it is my experiences with LLMs in a domain that I know very well that makes me skeptical of their performance across the board. I find issues in code review multiple times a day with their output, and they are explicitly and extensively trained on this use-case, unlike with the MRI data. Sometimes I veer into other domains I have decent knowledge about (construction, carpentry, landscaping) and LLMs disappoint me there as well.
I suppose Gell-Mann amnesia is a universal human quirk and not restricted to just the news.
One doctor diagnosis + LLM is gonna throw you off. You need more datapoints.
The LLM doesn’t need to be leading or whatever but then you can have a conversation with the patient. If their ChatGPT reports has differences it can be analyzed as well.
It feels like the time constraint of the 15m doctor sessions is the thing. But if prepared immediately after the scan then why not?
There is always time needed to factor in new developments and innovations and that’s fine. Just moving blindly work from human to LLM is wrong. But learning on and testing with all the ai tools incoming constantly won’t be a waste. There will be more and more tools in those processes outside of human judgement, better improve the workflows now to be able to test and plugin new models and systems when they are ready.
Because they don't exist, yet.
In the UK MRIs and other imaging systems need two opinions. there has been a move to allow the first opinion to be ML based.
The _problem_ is that you are basically doing grey smudge analysis, and thats fucking hard.
> As detailed in a new, yet-to-be-peer-reviewed paper, a team of researchers at Stanford University found that frontier AI models readily generated “detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided.”
> In other words, the AI models happily came up with answers to questions about a supposedly accompanying image — even if the researchers never even showed it an image.
> As opposed to hallucinations, which involve AI models arbitrarily filling in the gaps within a logical framework, the team coined a new term for the phenomenon: “mirage reasoning.”
> The effect “involves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user and basing the rest of the conversation on that, therefore changing the context of the task at hand,” the researchers wrote in their paper.
> The damning findings suggest AI models cheat by diving into the data they were given — and coming up with the rest based on probability, even if it’s almost entirely conjecture.
I know you can’t trust an LLM’s self-assessed “confidence” of a prediction, but I’ve found that confidence can at least be directionally correct for some tasks. For our benchmarks, however, confidence was poorly correlated. What’s worse is that binary classification models (“Do you see $diagnosis in this photo?”) highly influenced the LLM to confidently predict $diagnosis.
I’m concerned for those using LLMs for diagnostics, and getting confidently led to the wrong conclusion.
What I’ve seen be the true bottleneck is people not setting up the structured data. But making a tiny reasoning model with OPSD -> GRPO is totally doable with a bit of money.
I wonder if the above problem can be fixed similarly? Just ask the LLM to do a conservative grounding analysis before jumping to the main task?
https://www.nature.com/articles/d41586-026-01947-1
I've started asking my doctors whether they use AI, and if they say yes look for another one.
A very plausible explanation for the adenoma detection rate to have gone down is simply that its prevalence went down among the population in the second three-month period.
This was not a randomized trial. Concluding that "AI usage degrades physicians' skills" is questionable at the very least.
And well, yes, I have the appropriate life science degrees to navigate clinical trial reports and research publications, and that was likely indispensable for steering Claude Code where it went, the radiologist's caution is merited here. But it's just not amateur hour for me to do this, it's 2 decades of academic research in my rearview mirror.
Many can get paid fee-for-service for after hours work, so would probably prefer that.
LLMs are the best PDF-to-markdown converters, in my experience. I have a CLI that converts PDF to PNG, then run a background agent to "read" each PNG and write it down as markdown; it works flawlessly even for complex math formulas, it can "translate" complex charts, graphs, and tables into words.
It's slow and arguably expensive compared to traditional OCR, but very effective and precise.
If the author would actually go for a second opinion (maybe bring along the AI to let it explain it's findings), then the article could read as "AI did MRI analysis and proved my doctor wrong" (or: "AI did MRI analysis and failed").
An AI telling you it could be X or Y because theory ABC… is the academic answer and a luxury clinicians don’t have. AI doesn’t give you what you want. I don’t see any added value in using generic AI models for this
I found that while Claude, GPT etc could describe an image, there was no way to link the description back to specific pixels in the image itself. Not even to a bounding box or segment.
Even a tiny injury can severely cripple us.
It's not true that "AI makes mistakes" or "ChatGPT is sycophantic". It's just that sometimes the simulated extensions to the training material are accurate, and sometimes they're not.
IME, on an almost daily basis, claude.ai and Claude Code are confidently wrong about something, and use polished language to assert nonsense.[*]
If it's doing that on something easy, like factual knowledge available in text on the Internet, or programming code that can be inspected easily and follows well-known rules, and I can tell, because I understand those things... then there's no way I'm going to assume that Claude doesn't also BS when it comes to someone else's field. Especially not a field that requires some of the smartest people to go a decade of training, just to get started in the field.
[*] And if I confront Claude with its mistakes, eventually it apologizes, and acts as if it's learned something, again mimicking word patterns it's heard real people use and mean, without meaning any of it. I wonder whether the AI user experience would be better, if LLM-ish interfaces weren't implicitly created in the image of fake-it-till-you-make-it overconfident performative sociopathic techbros.
But are you all forgetting that they literally injected a homeopathic drug on the author?
Between that and Claude sometimes hallucinating, it’s probably worth encouraging patients to take second opinion always.