This guide walks you through the theory and practice of fine-tuning embedding models using EmbeddingGemma as an example. If you’d rather skip the explanations and jump straight into code, I’ve prepared a Colab notebook you can run yourself — it’s well-documented and takes about 10-20 minutes with GPU enabled.
Still here? Great, let’s start with the basics.
What is fine-tuning and why bother?
Pre-trained models are generalists. They’ve seen billions of words and learned what “similar” means across the entire internet. That’s impressive — but it’s also the problem. Your domain has its own vocabulary, its own acronyms, its own meaning of words that the rest of the world uses differently.
Fine-tuning is the process of taking a pre-trained model and teaching it the nuances of your specific world. Instead of training from scratch (expensive, slow, requires massive data), you start with a model that already understands language and nudge it toward your use case. Think of it as hiring someone with great general skills and then onboarding them to your company’s way of doing things.
Let’s redefine a well-known abbreviation – the meaning of LLM
For embedding models specifically, fine-tuning adjusts what the model considers “similar.” Let me give you an example we’ll use throughout this guide: imagine your company’s Project Management Office has been using the acronym “LLM” to mean “Lesson Learned Meeting” since 2005 — long before the AI world claimed those three letters. In that specific company when someone searches your internal knowledge base for “LLM requirements,” they want meeting templates and scheduling guidelines, not GPU clusters and transformer architectures.
Out of the box, an embedding model has no idea your PMO exists. It thinks “LLM” is closest to “Gen AI” and “neural network.” After fine-tuning on your domain data, it learns that “LLM” actually belongs near “retrospective” and “meeting agenda.” Same abbreviation, completely different neighborhood.
What are embedding models?
Embedding models convert text into numbers — specifically, into lists of numbers called vectors. Why? Because computers can’t measure how similar two sentences are by reading them, but they can easily calculate the distance between two points in space.
Think of it as placing words on a map. The image above shows a simplified version: fruits cluster together, tools cluster together, and “Galaxy” sits alone — it doesn’t belong to either group. So when we add ‘Banana’ it is placed closer to other fruits. In reality, embedding models work in hundreds or thousands of dimensions (not just X and Y), but the principle is the same: similar concepts end up close to each other.
This is what powers semantic search. When you search for “tropical fruit,” the system doesn’t just look for those exact words — it finds documents near that region of the embedding space, which might include “banana” or “mango” even if they don’t contain “tropical” anywhere in the text.
Why fine-tune an embedding model?
Off-the-shelf embedding models are trained on general data — Wikipedia, books, web pages. They work surprisingly well for most tasks. But “surprisingly well” isn’t always good enough. Here’s when fine-tuning pays off:
Domain-specific vocabulary. Your industry has jargon, acronyms, and terms that mean something different in your context. Medical, legal, financial, or engineering domains all have words that a general model will misinterpret — or in our case, “LLM” meaning a meeting, not a neural network.
Better RAG retrieval. If your Retrieval-Augmented Generation system keeps pulling irrelevant documents, the problem often isn’t the LLM doing the answering — it’s the embedding model doing the searching. Fine-tuning helps it understand what “relevant” means for your specific knowledge base.
Classification and clustering. When you need to group support tickets, categorize documents, or detect similar items, a fine-tuned model will draw boundaries that match your business logic, not the internet’s general understanding.
Multilingual edge cases. Sometimes the same term appears across languages but means different things in your domain. Fine-tuning can teach these distinctions.
What is EmbeddingGemma?
EmbeddingGemma is Google’s lightweight embedding model, released in September 2025. At just 308 million parameters, it’s tiny — small enough to run on mobile phones, laptops, or a free Colab notebook, using under 200MB of RAM when quantized.
What makes it interesting is its architecture. Unlike older embedding models based on BERT, EmbeddingGemma is built on top of Gemma 3 — Google’s open-weight large language model. This has a practical advantage: it was trained on over 100 languages, so the embedding space is shared across them.
Why does that matter for fine-tuning? In our experiment, we trained the model entirely on English examples. But when we tested it with Polish queries, the improvements transferred automatically. The token “LLM” is identical in both languages and acts as an anchor — once the model learns that “LLM” belongs near “meeting” and “retrospective” in English, Polish queries benefit too.
EmbeddingGemma also uses task-specific prompts to optimize embeddings for different use cases — retrieval, classification, clustering, and more. You’ll see this in action when we get to the code.
Anchor: "Who facilitates the LLM?"
Positive: "The Scrum Master or an external neutral party should guide the retrospective discussion."
Negative: "The transformer architecture relies on self-attention mechanisms to process text."
See what’s happening? The anchor mentions “LLM,” and we’re telling the model: the correct answer is about meeting facilitation, not transformer architecture. Repeat this pattern multiple times with variations, and the model starts to get it.
Sentence Transformers
For the actual training, we’ll use Sentence Transformers — a Python library that makes fine-tuning embedding models surprisingly painless. It handles the training loop, loss functions, and prompt management for us.
The key pieces:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss
model = SentenceTransformer("google/embeddinggemma-300m")
loss = MultipleNegativesRankingLoss(model)
The training data
I created about 65 triplets with help from Gemini. The key rule: none of these examples overlap with the test data. We want the model to learn what “LLM” means in our domain — not memorize specific answers.
All anchors are questions about “LLM” (meetings). All positives are PM-related content. All negatives are AI-related content. This teaches a clear boundary.
The full dataset and training code are in the Colab notebook — training takes about 2 minutes with GPU, around 10 minutes on CPU.
Results and honest afterthoughts
The test setup
To measure whether fine-tuning worked, we need a clear test. Here’s what we used:
Query: “What are the requirements for LLMs?”
Documents (three about AI, three about meetings):
#A Large Language Models require massive GPU clusters for training.
#A Next-token prediction is the core task of the large language models.
#A Context length limits how much information the model can process.
#B Schedule the lesson learned meeting within 10 days of closure.
#B Use the retrospective template from SharePoint for the meeting agenda.
#B All core team members must attend the meeting to discuss what went well.
In our PMO world, the correct answers are the #B documents. Let’s see how the vanilla model handles this.
Before: the vanilla model gets it wrong
No surprises here — the vanilla EmbeddingGemma ranks all AI documents at the top:
| Rank | Document | Score |
|---|---|---|
| 1 | #A Large Language Models require massive GPU… | 0.35 |
| 2 | #A Next-token prediction is the core task… | 0.19 |
| 3 | #A Context length limits… | 0.16 |
| 4 | #B All core team members must attend… | 0.15 |
| 5 | #B Schedule the lesson learned meeting… | 0.10 |
| 6 | #B Use the retrospective template… | 0.05 |
The model has no idea our PMO exists. It thinks “LLM” means what the rest of the internet thinks it means.
After: it worked
After two epochs of training on our 65 triplets, the ranking flipped completely:
| Rank | Document | Score |
|---|---|---|
| 1 | #B Use the retrospective template… | 0.95 |
| 2 | #B Schedule the lesson learned meeting… | 0.95 |
| 3 | #B All core team members must attend… | 0.94 |
| 4 | #A Next-token prediction is the core task… | -0.49 |
| 5 | #A Large Language Models require massive GPU… | -0.54 |
| 6 | #A Context length limits… | -0.55 |
The PM documents jumped to the top with near-perfect scores. The AI documents didn’t just drop — they went negative. The model now actively pushes them away.
Cross-lingual transfer: a free bonus
Here’s something we didn’t train for. When tested with Polish queries like “co to jest LLM?” (what is LLM?), the improvements transferred automatically.
Why? EmbeddingGemma’s embedding space is shared across languages, and the token “LLM” is identical in Polish and English. It acts as an anchor — once the model learns “LLM” belongs near “meeting” in English, Polish queries benefit too. We fine-tuned in one language and got improvements in another for free.
Training loss hit zero — should we worry?
The training loss dropped to 0.000000 after the first epoch. That’s a classic overfitting signal — the model memorized all 65 examples perfectly.
In our case, it still generalized to unseen test data, so we’re fine. But for production, you’d want a validation split to monitor generalization, more training examples, and possibly early stopping.
The bias we baked in
There’s a subtle issue. In the notebook, I ran a sanity check asking “What is a CPU?” The model still ranks correct answers at the top — but it scores a random PM sentence higher than AI content, even though neither is relevant to CPUs.
Why? Every positive in our training was PM content. Every negative was AI content. The model didn’t just learn “LLM = meeting” — it partially learned “PM = relevant, AI = irrelevant.”
For our specific use case (a PMO knowledge base), that’s fine. But if your domain covers both topics, you’d need more balanced negatives. Fine-tuning updates shared weights — every bias you train affects the whole embedding space.
Wrapping up
We took a general-purpose embedding model and taught it that “LLM” means something completely different in our world — with just 65 examples and a few minutes of training. The process isn’t magic: define what “similar” means in your domain, create triplets that teach that definition, and let the model adjust.
The full code is in the Colab notebook — fork it, swap in your own domain vocabulary, and see what happens.

