paint-brush
Hallucinations by Design (Part 2): The Silent Flaws of Embeddings & Why Your AI Is Getting It Wrongby@riteshmodi
177 reads New Story

Hallucinations by Design (Part 2): The Silent Flaws of Embeddings & Why Your AI Is Getting It Wrong

by Ritesh ModiApril 1st, 2025
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

This is the second part in the series on Hallucinations by Design. It is a continuation of our previous discussion on how embeddings hallucinate. We're basically working with models that can't tell between speculation and confirmation.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coin Mentioned

Mention Thumbnail
featured image - Hallucinations by Design (Part 2): The Silent Flaws of Embeddings & Why Your AI Is Getting It Wrong
Ritesh Modi HackerNoon profile picture
0-item

Caption: The two characters look different but share a striking similarity in posture, expression, and background—almost like they are "embeddings" of different sentences that end up close together.


READ PART-1 here (https://hackernoon.com/hallucination-by-design-how-embedding-models-misunderstand-language)


Last month, I shared how embedding models hallucinate when handling simple language variations like negation and capitalization. The response was overwhelming – seems I'm not the only one who's been burnt by these issues. Today, I'm diving deeper into even more troubling blind spots I've discovered through testing. These are the kinds that keep me up at night and make me question everything about how we're building AI systems.


This is the second part in the series on Hallucinations by Design. It is a continuation of our previous discussion on how embeddings hallucinate. To get the most out of this article, I highly recommend reading the linked article first, as it lays the foundational concepts necessary to fully grasp the ideas explored here. By doing so, you'll have a seamless learning experience and a deeper understanding of the topic.

Hypothetical vs. actual? Just details!

Here's where things get truly disturbing. When I ran "If the treatment works, symptoms should improve" against "The treatment works and symptoms have improved", the similarity score hit 0.95. I sat staring at my screen in disbelief. One's speculating about potential outcomes; the other's reporting confirmed results!


I hit this problem working on a clinical research document. The search couldn't distinguish between hypothesized treatment outcomes and verified results. Doctors searching for proven treatments were getting mixed results with unproven hypotheses. Do you think physicians making treatment decisions appreciate confusing speculation with evidence? I am sure I wouldn't want my medical care based on "might work" rather than "does work".


Again, think about all the cases where distinguishing hypotheticals from facts is essential - scientific research, medical trials, legal precedents, and investment analyses. When your model conflates "if X then possibly Y" with "X happened and caused Y", you've completely misunderstood the epistemic status of the information. We're basically working with models that can't tell the difference between speculation and confirmation despite analyzing text where this distinction determines whether something is reliable information or mere conjecture.

Temporal order? Whatever order!

Embedding models see "She completed her degree before starting the job" and "She started her job before completing her degree" as NEARLY identical – ridiculous 0.97 similarity score. One's a traditional career path; the other's working while studying. Completely different situations!


I found this while building a resume screening system. The embeddings couldn't distinguish between candidates who finished their degrees before working and those who were still completing studies. Hiring managers wasted hours interviewing candidates who didn't meet their basic qualification requirements. Do you think busy recruiters appreciate having their time wasted with mismatched candidates? I am sure I wouldn't want my hiring pipeline filled with noise.


Think about all the cases where sequence is crucial – medical treatment protocols, legal procedural requirements, cooking recipes, assembly instructions, and chemical formulations. When your model can't tell "A before B" from "B before A," you've lost fundamental causal relationships. We're basically working with models that treat time as an optional concept despite analyzing text that's full of critical sequential information.

Quantitative thresholds vanish into thin air

This one actually made me spill my coffee. Embedding models see "The company barely exceeded earnings expectations" and "The company significantly missed earnings expectations" as SHOCKINGLY similar – 0.93 similarity score. Exceeded versus missed! These mean opposite things in finance!


If you are building a financial news analysis system, the embeddings wouldn’t distinguish between positive and negative earnings surprises – literally the difference between stock prices going up or down. Investors making trading decisions based on our summaries were getting completely contradictory information. Do you think people risking actual money appreciate getting fundamentally wrong market signals? I am sure I wouldn't want my retirement account guided by such confusion.


Now, think about all the cases where crossing a threshold changes everything – passing vs. failing grades, healthy vs. dangerous vital signs, profitable vs. unprofitable businesses, compliant vs. non-compliant regulatory statuses. Your model loses its ability to make meaningful distinctions when it cannot distinguish between barely meeting the target and completely missing it. We're basically working with models that don't understand the concept of thresholds despite analyzing text that's constantly discussing whether targets were met or missed.

Scalar inversions get completely flipped

The absurdity just keeps piling up. During testing, I found that "The meeting ran significantly shorter than planned" and "The meeting ran significantly longer than planned" scored a 0.96 similarity. I was in complete shock. These sentences describe completely opposite situations – time saved versus time wasted!


I encountered this with project management documents. The search couldn't distinguish between schedule overruns and efficiencies. Managers searching for examples of time-saving techniques were getting shown projects with serious delays. Do you think executives tracking project timelines appreciate getting the exact opposite information they asked for? I am sure I would be furious if I were preparing for a board meeting with such backward data.


Think about all the cases where direction on a scale is crucial – cost savings vs. overruns, performance improvements vs. degradations, health improvements vs. declines, and risk increases vs. decreases. When your model treats "much higher than" as interchangeable with "much lower than", you've lost the ability to track directional change. We're basically working with models that don't understand opposing directions despite analyzing text filled with comparative assessments.

Domain-specific opposites look like synonyms

Medical documents

I couldn't believe what I was seeing in the healthcare tests. "The patient presents with tachycardia" versus "The patient presents with bradycardia" returned a 0.94 similarity score. For non-medical folks, that's like confusing a racing heart with one that's dangerously slow – conditions with opposite treatments!


I discovered this while working on a symptom-matching system for electronic health records. The embedding model couldn't distinguish between fundamentally different medical conditions that require opposite treatments. Physicians searching for cases similar to a patient with a racing heart were shown cases of patients with dangerously slow heartbeats. Do you think doctors making time-sensitive decisions appreciate getting contradictory clinical information? I am sure I wouldn't want my treatment based on the opposite of my actual condition.


In the field of medicine, these distinctions can have significant consequences. Tachycardia might be treated with beta-blockers, while bradycardia might require a pacemaker – giving the wrong treatment could be fatal. We're basically working with models that can't distinguish between opposite medical conditions despite analyzing text where this distinction determines appropriate care.

The legal tests were just as bad. When comparing "Plaintiff bears the burden of proof" with "Defendant bears the burden of proof", the model returned a staggering 0.97 similarity. Let that sink in. These statements literally determine which side has to prove their case in court! Mixing these up could lose you your lawsuit.


The search couldn't distinguish between fundamentally different legal standards and responsibilities. Lawyers researching precedents about plaintiff burdens were shown cases discussing defendant burdens. Do you think attorneys preparing for trial appreciate getting precisely backward legal standards? I am sure I wouldn't want my lawsuit built on completely inverted legal principles.


In legal contexts, who bears the burden of proof often determines the outcome of a case. When your model can't distinguish which party has which responsibilities, you've undermined the entire basis of legal reasoning. We're basically working with models that confuse legal roles despite analyzing text where these distinctions define how justice functions.

Units of measurement

I had to run this test multiple times because I couldn't believe the results. "The procedure takes about 5 minutes" versus "The procedure takes about 5 hours" scored a whopping 0.97 similarity. Is this for real? That's a 60x time difference! Imagine waiting for your "5-minute" appointment that actually takes 5 hours.


I found this while building the same healthcare system. The embeddings couldn't distinguish between brief and lengthy procedures. Clinic managers trying to schedule short procedures were being shown lengthy operations that would block their surgery suites for entire days. Do you think medical facilities with tight scheduling constraints appreciate having their entire day's workflow disrupted? I am sure I wouldn't want my hospital running 60x behind schedule.


Units of measurement fundamentally change meaning. When your model treats "5 minutes" and "5 hours" as essentially identical, you've lost the ability to understand magnitude. We're basically working with models that ignore units despite analyzing text where units determine whether something is trivial or significant.

More measurement problems

And it just gets worse from there. During the use of the same healthcare documents, I found "The tumor is 2 centimeters in diameter" and "The tumor is 2 inches in diameter" scored an alarming 0.98 similarity. For context, that's the difference between a potentially minor tumor and one that's 2.54x larger – often the threshold between "watch and wait" versus immediate surgery.


The embeddings couldn't distinguish between metric and imperial measurements. Oncologists researching treatment options for small tumors were being shown cases of much larger growths. Do you think cancer specialists appreciate getting case studies that aren't remotely comparable to their patients?


Even speed limits get confused. Models treat "Maintain speeds under 30 mph" and "Maintain speeds under 30 kph" as HIGHLY similar – a problematic 0.96 similarity score. That's the difference between 30 and 18.6 miles per hour – enough to determine whether an accident is fatal!


Converting between units isn't just a mathematical exercise – it fundamentally changes recommendations, safety parameters, and outcomes. We're basically working with models that think numbers without units are sufficient despite analyzing text where the units completely transform the meaning.

The Truth and the Results

Here is the comparison between msmarco-distilbert-base-tas-b, all-mpnet-base-v2, and open-ai-text-embedding-3-large, and you will notice that there is no significant difference between the output of these models.


                                           ***msmarco-distilbert-base-tas-b embedding score across different test cases***

                                              ***all-mpnet-base-v2 embedding score across different test cases***

                                         ***openai-text-embedding-3-large embedding score across different test cases***

Just to repeat..

Look, embeddings are amazingly useful despite these problems. I'm not advocating against using them, but rather, it's crucial to approach them cautiously. Here's my battle-tested advice after dozens of projects and countless failures:


  1. Test your model on real user language patterns before deployment. Not academic benchmarks, not sanitized test cases – actual examples of how your users communicate. We built a "linguistic stress test" toolkit that simulates common variations like negations, typos, and numerical differences. Every system we test fails in some areas – the question is whether those areas matter for your specific application.


  2. Build guardrails around critical blind spots. Different applications have different can't-fail requirements. For healthcare, it's typically negation and entity precision. For finance, it's numbers and temporal relationships. For legal, it's conditions and obligations. Identify what absolutely can't go wrong in your domain, and implement specialized safeguards.


  3. Layer different techniques instead of betting everything on embeddings. Our most successful systems combine embedding-based retrieval with keyword verification, explicit rule checks, and specialized classifiers for critical distinctions. This redundancy isn't inefficient; it's essential.

  4. Be transparent with users about what the system can and can't do reliably. We added confidence scores that explicitly flag when a result might involve negation, numerical comparison, or other potential weak points. Users appreciate the honesty, and it builds trust in the system overall.


**Here's the most important thing I've learnt:**these models don't understand language the way humans do – they understand statistical patterns. When I stopped expecting human-like understanding and started treating them as sophisticated pattern-matching tools with specific blind spots, my systems got better. Much better.

The blind spots I've described aren't going away anytime soon – they're baked into how these models work. But if you know they're there, you can design around them. And sometimes, acknowledging a limitation is the first step toward overcoming it.


Note: I have many more such cases found through experiments, and I will cover them in my subsequent post.

The next continuation article will be coming out soon. Stay tuned!!