Hallucinations by Design: Part 4 - Fine-tuning Your Way Out of Vector Nightmares

The code for all the part in this series is available @ Github

I've spent the last three articles exposing the uncomfortable truth about embedding models - they hallucinate by design. We've seen how these models misunderstand language, contain silent flaws, and why blindly trusting vectors without testing them leads to disaster.

Today, I want to shift gears. Let's talk solutions.

For optimal comprehension, I strongly suggest reading the previous articles first (PART-1, PART-2 and PART-3) to establish the essential background needed to fully appreciate the concepts we'll be discussing here. This sequential approach will provide you with a more coherent understanding of these critical issues.

If you've been following this series, you understand the problems plaguing embedding models like MPNet, MS MARCO, and various OpenAI offerings. Their limitations aren't just academic concerns - they're practical roadblocks that undermine your AI applications every day.

Fine-tuning these embedding models represents one of our most promising paths forward. It's not a silver bullet, but it's a powerful approach that can dramatically reduce hallucinations and improve semantic understanding.

In this final installment, I'll walk through practical strategies for fine-tuning embedding models. We'll explore techniques that transform generic, hallucination-prone embeddings into domain-specific tools that actually understand your data. Think of it as teaching a model to speak your language instead of hoping it guesses what you mean.

Whether you're building RAG systems, semantic search, or any application that needs to understand meaning beyond keywords, this article aims to help you move from diagnosing problems to implementing solutions.

Today I'll walk you through creating a custom embedding model that actually understands your domain. We'll:

Set up the environment
Preparing training, validation and test data for fine-tuning model
Pick a base model to fine-tune from huggingface
Provide configuration and hyper-parameters for fine-tuning the model
Train the model and save it
Evaluate the model and compare with base model

We'll use the sentence-transformers package from HuggingFace for this work. This process transforms generic embeddings into tools that grasp your specific terminology and relationships.

SentenceTransformers is a powerful Python framework built on top of HuggingFace's Transformers library that makes creating and fine-tuning embedding models surprisingly straightforward. It ships with pre-trained models that have been optimized for semantic similarity tasks, information retrieval, and clustering. The framework's real strength lies in how it simplifies the fine-tuning process with built-in support for various loss functions (CosineSimilarity, TripletLoss, etc.) and training objectives tailored to different use cases.

For embedding model fine-tuning, it's the perfect balance of flexibility and ease-of-use - you get access to state-of-the-art architectures without having to manage the underlying complexity of transformer models. This makes it ideal for solving the hallucination problems we've been discussing throughout this series.

🔧 Setting Up Your Environment

Setting up a dedicated environment isn't just good practice—it's essential for reproducible fine-tuning. Before we dive into fine-tuning embedding models, we need a clean, reproducible environment. This step is crucial but often overlooked.

# Option 1: Using virtualenv
python -m venv halluc-env
source halluc-env/bin/activate  # On Windows: halluc-env\Scripts\activate
pip install -r requirements.txt

# Option 2: Using conda
conda create -n halluc-env python=3.10
conda activate halluc-env
pip install -r requirements.txt

# Option 3: Using uv (Ultra fast package installer)
uv venv halluc-env
source halluc-env/bin/activate
uv pip install -r requirements.txt

The environment depends on python packages:

torch==2.6.0
sentence-transformers==4.1.0
datasets==3.5.0
openai==1.75.0
python-dotenv==1.1.0
transformers[torch]==4.51.3
matplotlib==3.10.1
seaborn==0.13.2

🗃️ Preparing Training Data

Fine-tuning embedding models requires high-quality, domain-specific data. When foundation models hallucinate, it's often because they don't understand the nuances of your specific domain. Let's solve that.

Your data format depends on the loss function you use during fine-tuning. Let's explore two popular options:

Option 1: CosineSimilarityLoss with EmbeddingSimilarityEvaluator

This approach requires data in a paired format with a similarity score:

# Required data format
data_format = {
    'sentence1': ['The patient has hypertension', 'Take medicine before meals'],
    'sentence2': ['The patient has high blood pressure', 'Take meals before medicine'],
    'similarity': [0.9, 0.2]  # Scores between 0 and 1
}
df = pd.DataFrame(data_format)

Converting to the format expected by sentence-transformers:

from sentence_transformers import InputExample
from torch.utils.data import DataLoader

# Convert data to InputExamples
train_examples = [
    InputExample(texts=[row['sentence1'], row['sentence2']], label=row['similarity']) 
    for _, row in df.iterrows()
]

# Create data loader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Option 2: TripletLoss

For TripletLoss, you need anchor-positive-negative triplets:

# Required data format for triplets
triplet_format = {
    'anchor': ['The patient has hypertension', 'Take medicine before meals'],
    'positive': ['The patient has high blood pressure', 'Take medication prior to eating'],
    'negative': ['The patient has low blood pressure', 'Take medicine after meals']
}
triplet_df = pd.DataFrame(triplet_format)

# Convert to InputExamples for TripletLoss
triplet_examples = [
    InputExample(texts=[row['anchor'], row['positive'], row['negative']])
    for _, row in triplet_df.iterrows()
]

# Create triplet dataloader
triplet_dataloader = DataLoader(triplet_examples, shuffle=True, batch_size=16)

Why Data Format Matters

The data format directly impacts what your model learns:

CosineSimilarityLoss teaches your model to map sentences with similar meanings to vectors with high cosine similarity. The evaluator measures how well your model has learned this mapping.
TripletLoss teaches your model that the anchor should be closer to the positive example than to the negative example. This is particularly useful for retrieval tasks.

The Importance of Proper Data Splits

Each data split serves a critical purpose:

Training data (80%) is what your model learns from. This should cover all the nuances and edge cases you want your model to understand.
Validation data (10%) helps you tune hyperparameters and avoid overfitting. During training, you'll regularly evaluate your model on this data to see if it's generalizing well.
Test data (10%) provides an unbiased evaluation of your final model. You should only use this data once, after training is complete. It tells you how well your model will perform on unseen data.

When fine-tuning embedding models, proper data splitting is crucial because:

It prevents overfitting to training quirks
It ensures your model generalizes to new examples
It provides honest metrics about model performance

# Standard split ratios
train_ratio = 0.8
val_ratio = 0.1
test_ratio = 0.1

# Split data
from sklearn.model_selection import train_test_split

# First split: separate test set
train_val_df, test_df = train_test_split(df, test_size=test_ratio, random_state=42)

# Second split: separate validation set from training set
train_df, val_df = train_test_split(train_val_df, test_size=val_ratio/(train_ratio+val_ratio), random_state=42)

print(f"Training examples: {len(train_df)}")
print(f"Validation examples: {len(val_df)}")
print(f"Test examples: {len(test_df)}")

The repository for this article provides all three kinds of data in csv format.

Generating Synthetic Training Data

While real-world domain data is ideal, we can generate synthetic data using LLMs to target specific weaknesses in embedding models. Here's a prompt to help generate data that addresses common embedding failures:

prompt = """
    Generate {num_examples} training examples for fine-tuning an embedding model. 
    Each example should have two sentences and a similarity score between 0 (completely different) and 1 (identical).
    
    Focus on these challenging patterns:
    1. Negation (e.g., "The medicine contains aspirin" vs. "The medicine does not contain aspirin" → similarity: 0.1)
    2. Capitalization of domain-specific terms (e.g., "she visited paris" vs. "She has Paris syndrome" → similarity: 0.3)
    3. Numeric magnitude differences (e.g., "administer 5mg dosage" vs. "administer 50mg dosage" → similarity: 0.4)
    4. Temporal ordering (e.g., "Take medicine before meals" vs. "Take meals before medicine" → similarity: 0.2)
    5. Domain-specific synonyms (e.g., "patient exhibits hypertension" vs. "patient has high blood pressure" → similarity: 0.9)
    
    Return as a JSON array with fields: sentence1, sentence2, similarity
    """

🔍 Selecting a Base Model for Fine-tuning

Choosing the right starting model is critical. This decision impacts everything from training speed to final performance. Let's explore some strong candidates from HuggingFace and how to evaluate them for your specific use case.

Key Selection Criteria

Domain alignment: Choose a base model that's conceptually close to your target domain. For medical text, clinical BERT variants may perform better than general models.
Size vs. performance trade-off: Larger models generally perform better but require more compute resources for fine-tuning and deployment.
Inference speed requirements: If you need real-time embeddings in production, a smaller model might be preferable despite slightly lower quality.
Training stability: Some models fine-tune more reliably than others. Models from the sentence-transformers library are specifically designed for fine-tuning.
Community support: Models with active maintenance and large user bases tend to have better documentation and fewer unexpected behaviors.

Here's a selection of embedding models that work well as starting points:

Model	Size	Strengths	Best for
`sentence-transformers/all-MiniLM-L6-v2`	80MB	Fast, compact, good general performance	Resource-constrained environments, mobile applications
`sentence-transformers/all-mpnet-base-v2`	420MB	Excellent general performance, handles longer text	General-purpose embeddings with good quality-speed tradeoff
`sentence-transformers/multi-qa-mpnet-base-dot-v1`	420MB	Optimized for retrieval, handles questions and answers	RAG systems, Q&A applications
`intfloat/e5-large-v2`	1.3GB	State-of-the-art performance, rich semantic understanding	When quality is the top priority
`BAAI/bge-large-en-v1.5`	1.3GB	Strong on retrieval benchmarks, works well with Chinese and English	Multilingual applications, search systems

Once you've selected a model, loading it is straightforward:

from sentence_transformers import SentenceTransformer

# Replace with your chosen model
base_model = 'sentence-transformers/all-mpnet-base-v2'
model = SentenceTransformer(base_model)

# Optional: Move to GPU if available
if torch.cuda.is_available():
    model = model.to(torch.device('cuda'))

print(f"Model loaded with embedding dimension: {model.get_sentence_embedding_dimension()}")

⚙️ Configuration and Hyperparameters

Fine-tuning embedding models requires careful configuration of hyperparameters to maximize performance while preventing overfitting. Let's dive into the key settings that can make or break your fine-tuning process.

Some of the important hyper-parameters and configurations:

train_objectives: Pairs your training data with the loss function that guides the learning process.
evaluator: The component that measures model performance on validation data during training.
epochs: Total number of complete passes through the training dataset.
warmup_steps: Number of steps to gradually increase the learning rate, helping stability.
optimizer_params: Custom configuration for the optimizer, like learning rate and weight decay.
scheduler: Controls how learning rate changes during training (e.g., 'WarmupLinear').
output_path: Where to save your final fine-tuned model.
evaluation_steps: How often to evaluate model performance (here, twice per epoch).
save_best_model: Only keeps the version with the best validation score.
use_amp: Enables mixed precision training to speed up training on compatible GPUs.
checkpoint_path: Directory for saving intermediate model versions during training.
checkpoint_save_steps: How often to save checkpoints (here, once per epoch).
checkpoint_save_total_limit: Maximum number of checkpoint files to keep (prevents disk filling).
show_progress_bar: Displays visual training progress in your console.

These are used for fine-tuning of the model.

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    optimizer_params=optimizer_params,
    scheduler=scheduler,
    output_path=str(output_path),
    evaluation_steps=len(train_dataloader) // 2,
    save_best_model=True,
    use_amp=True,  # Use mixed precision training if GPU supports it
    checkpoint_path=str(output_path / "checkpoints"),  # Save checkpoints during training
    checkpoint_save_steps=len(train_dataloader),  # Save every epoch
    checkpoint_save_total_limit=3,  # Keep only the 3 most recent checkpoints
    show_progress_bar=True
)

🚀 Training the Model

Now that we've prepared our data and configured our hyperparameters, it's time to bring everything together and train our embedding model. This is where the magic happens - transforming a generic embedding model into one that understands your specific domain and avoids hallucinations.

    # Load the base model
    logging.info(f"Loading base model: {model_name}")
    model = SentenceTransformer(model_name)
    
    # Set max sequence length
    model.max_seq_length = max_seq_length
    
    # Load training data
    logging.info("Loading training data")
    train_df = load_data(f"{data_path}/train.csv")
    val_df = load_data(f"{data_path}/val.csv")
    
    # Convert data to InputExamples
    train_examples = [
        InputExample(texts=[row['sentence1'], row['sentence2']], label=row['similarity']) 
        for _, row in train_df.iterrows()
    ]
    
    # Create data loader
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=train_batch_size)
    
    # Set up evaluator
    logging.info("Setting up evaluator")
    evaluator = evaluation.EmbeddingSimilarityEvaluator(
        sentences1=val_df['sentence1'].tolist(),
        sentences2=val_df['sentence2'].tolist(),
        scores=val_df['similarity'].tolist()
    )
    
    # Set up the loss
    train_loss = losses.CosineSimilarityLoss(model)
    
    # Calculate warmup steps
    warmup_steps = int(len(train_dataloader) * num_epochs * warmup_ratio)
    
    # Train the model
    logging.info(f"Beginning training for {num_epochs} epochs")
    logging.info(f"Beginning training for {output_path} epochs")
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        evaluator=evaluator,
        epochs=num_epochs,
        warmup_steps=warmup_steps,
        output_path=str(output_path),
        evaluation_steps=len(train_dataloader) // 2,  # Evaluate twice per epoch
        save_best_model=True
    )
    
    # Test on the original problematic pairs
    logging.info("Evaluating on test set")
    test_df = load_data(f"{data_path}/test.csv")
    
    # Load the best model
    best_model = SentenceTransformer(str(output_path))
    
    # Encode the sentences
    embeddings1 = best_model.encode(test_df['sentence1'].tolist())
    embeddings2 = best_model.encode(test_df['sentence2'].tolist())
    
    # Calculate cosine similarities
    from sklearn.metrics.pairwise import cosine_similarity
    similarities = []
    for i in range(len(embeddings1)):
        sim = cosine_similarity([embeddings1[i]], [embeddings2[i]])[0][0]
        similarities.append(sim)
    
    # Add to test_df
    test_df['predicted_similarity'] = similarities
    
    # Print results
    logging.info("Test results:")
    for _, row in test_df.iterrows():
        logging.info(f"Pair: '{row['sentence1']}' vs '{row['sentence2']}'")
        logging.info(f"Predicted similarity: {row['predicted_similarity']:.4f}")
        logging.info("-----")
    
    # Calculate average similarity
    avg_similarity = sum(similarities) / len(similarities)
    logging.info(f"Average similarity on test set: {avg_similarity:.4f}")
    
    # Save test results
    test_df.to_csv(f"{output_path}/test_results.csv", index=False)
    logging.info("Test results saved to test_results.csv")
    
    logging.info("Fine-tuning complete!")

Key Training Steps Explained

Setup Environment: Configure logging, output directories, and verify GPU availability.
Load Base Model: Import the pre-trained model from HuggingFace that will be fine-tuned.
Prepare Data Loaders: Transform your CSV data into the format required by the SentenceTransformers library.
Configure Training Components: Set up the loss function and evaluation metrics.
Create Monitoring Callback: Implement visualization to track training progress.
Execute Training Loop: Call model.fit() with all parameters to start the fine-tuning process.
Save Training Configuration: Preserve hyperparameters for reproducibility.
Evaluate on Test Set: Measure performance on unseen examples.
Visualize Results: Create plots showing the correlation between expected and predicted similarities.

Expected Outputs

After successful training, you'll find these files in your fine-tuned-semantic-model directory:

pytorch_model.bin: The fine-tuned model weights
config.json: Model architecture configuration
training_config.json: Training hyperparameters
training_progress.png: Visualization of training metrics
test_results.csv: Detailed evaluation on test set
test_results.png: Visualization of expected vs. predicted similarity
training.log: Complete training log

📊 Evaluating the Model

After fine-tuning, it's crucial to properly evaluate your model to understand how well it addresses the hallucination issues we've been discussing throughout this series. A thorough evaluation compares your fine-tuned model against the base model to quantify improvements.

Evaluation Strategy

Here's a comprehensive approach to evaluating your embedding model:

Prepare Test Data: Ensure your test dataset includes examples that specifically target the hallucination types you're trying to fix (negation, capitalization, numeric differences, etc.)
Load Models: Load both your fine-tuned model and the original base model for head-to-head comparison.
Encode Test Sentences: Generate embeddings for each sentence pair in your test set with both models.
Calculate Similarity Metrics: Measure cosine similarity between sentence pairs and compare to expected values.
Analyze Performance by Category: Break down performance by different types of semantic challenges.
Visualize Results: Create charts that clearly demonstrate improvements in specific areas.
Run Statistical Tests: Determine if improvements are statistically significant.
Document Findings: Create a comprehensive report of your evaluation results.

# Load both models
base_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
fine_tuned_model = SentenceTransformer("./fine-tuned-model")

# Load test data
test_df = pd.read_csv("./data/test.csv")

# Test sentences focusing on negation examples
test_pairs = test_df[test_df['category'] == 'negation']

# Evaluate both models
results = []
for _, row in test_pairs.iterrows():
    # Get sentence pairs
    sent1, sent2 = row['sentence1'], row['sentence2']
    expected_sim = row['similarity']
    
    # Encode with base model
    base_emb1 = base_model.encode(sent1)
    base_emb2 = base_model.encode(sent2)
    base_sim = cosine_similarity([base_emb1], [base_emb2])[0][0]
    
    # Encode with fine-tuned model
    ft_emb1 = fine_tuned_model.encode(sent1)
    ft_emb2 = fine_tuned_model.encode(sent2)
    ft_sim = cosine_similarity([ft_emb1], [ft_emb2])[0][0]
    
    # Calculate errors
    base_error = abs(expected_sim - base_sim)
    ft_error = abs(expected_sim - ft_sim)
    
    results.append({
        'sentence1': sent1,
        'sentence2': sent2,
        'expected': expected_sim,
        'base_sim': base_sim,
        'ft_sim': ft_sim,
        'base_error': base_error,
        'ft_error': ft_error,
        'improvement': base_error - ft_error
    })

# Convert to DataFrame and analyze
results_df = pd.DataFrame(results)
print(f"Average base model error: {results_df['base_error'].mean():.4f}")
print(f"Average fine-tuned model error: {results_df['ft_error'].mean():.4f}")
print(f"Overall improvement: {results_df['improvement'].mean():.4f}")

Interpretation and Analysis

When analyzing the results, focus on these key aspects:

Overall Error Reduction: The primary metric is mean absolute error reduction. How much has the fine-tuned model improved over the baseline?
Category-Specific Improvements: Which types of hallucinations have been most successfully addressed? For example:
- Did negation handling improve significantly?
- Are numeric magnitude differences better recognized?
- Has capitalization sensitivity been fixed?
Error Distribution Changes: Has the distribution of errors changed? Are there fewer extreme errors?
Remaining Challenges: What types of examples still cause problems? These might need additional focused training.
Practical Impact: How do these improvements translate to real-world applications like retrieval or search?

What just happened - the results

After fine-tuning the base model, inference was done on both the fine-tuned model and base model and the results of embedddings were calculated:

Compare it with the base model. Almost all the hallucination types have more than 90% similarity.

For more details, please check the complete project on GitHub.

Conclusion: Beyond the Hallucinations

Throughout this series, we've taken a deep dive into the fundamental flaws of embedding models and their tendency to hallucinate by design. We've seen how models like MPNet, MS MARCO, and various OpenAI embeddings struggle with negation, capitalization, numeric differences, and temporal ordering - issues that undermine the reliability of AI applications built on these foundations.

The good news? Fine-tuning offers a practical path forward.

By carefully preparing targeted data that emphasizes the specific weaknesses we've identified, selecting the right base model, configuring appropriate hyperparameters, and implementing a robust training and evaluation process, we can dramatically reduce these hallucinations. The approach outlined in this article transforms generic embeddings into domain-specific tools that actually understand the nuances of your data.

Key Takeaways

Embedding models aren't magic - they require thoughtful adaptation to your specific domain to be truly reliable.
Data quality trumps quantity - carefully crafted examples that target specific weaknesses yield better results than massive generic datasets.
Systematic evaluation is essential - comparing your fine-tuned model against the base model across different hallucination categories provides actionable insights.
Fine-tuning is iterative - use your evaluation results to guide further refinements in your training data and process.
The process is accessible - with tools like sentence-transformers, you don't need specialized ML expertise to implement these improvements.

Real-World Impact

The improvements from fine-tuning embedding models extend far beyond academic exercises. They directly enhance:

Retrieval accuracy in RAG systems
Semantic search quality in knowledge bases
Clustering precision in data analysis
Recommendation relevance in content systems

Each percentage point of improvement in embedding quality compounds throughout these systems, dramatically reducing hallucinations and improving user trust.

Next Steps

Ready to implement these techniques? The complete code from this article is available in my GitHub repository at https://github.com/ritesh-modi/embedding-hallucinations. There you'll find:

Complete training and evaluation scripts
Data for training, validation and testing base and fine-tuned model
Comparing embeddings across all types of hallucinations
Comprehensive documentation

Final Thoughts

As we build more AI systems on vector representations, the quality of our embeddings becomes increasingly crucial. The hallucinations we've discussed aren't just technical curiosities—they're practical barriers to reliable AI. By understanding these flaws and systematically addressing them through fine-tuning, we create embedding models that truly grasp meaning rather than merely approximating it.

This concludes our "Hallucination by Design" series. I hope these articles have given you both insight into embedding model limitations and practical tools to overcome them. The path to more reliable AI isn't through blind trust in foundation models, but through thoughtful adaptation of these models to our specific needs and domains.

Remember: vectors don't hallucinate, but models do. With the right approach to fine-tuning, we can build embedding models that see the world more clearly—one domain at a time.

For more details, please check the complete project on GitHub.