When we humans read a text, we are not just reading the characters that make up the text. Instead, we map the words to a mental model of the world. For example, when you’re reading the word warm, you have an idea of what warm represents. You know how it relates to other concepts such as temperature, cold, or lava. One attempt at capturing these attributes is to create a numerical embedding of a text. We will not look into how these numerical embeddings are computed1. Instead, we will explore the embeddings themselves.
Given some string input like warm an embedding model will transform it into a vector. The exact interpretation of each dimension the vector is unknown to us. However, we can compare vectors to each other. The established way to compare to embedding vectors is to use the cosine similarity2. So for example we can compare the vector of warm to the vector of hot and we would expect then to be somewhat similar.
emb_warm = model.encode("warm")
emb_hot = model.encode("hot")
sim = cos(emb_warm,emb_hot)
print(sim) #-> 0.8
Okay, wonderful! We have an embedding model that can now compare words and provide a measure of similarity. But wait a minute… we all agree that the words warm and hot are similar. But what about words like warm and cold? Are they similar because they are related, or are they different because they express opposite sensations? Both perspectives are valid, so let’s explore each of them. To do this, we will assign pairwise similarity values to the words warm, hot, cold, and philosophy.
Let’s start with the sensation-based approach. The lowest cosine similarity score is -1, and the highest is 1. So, we might assign warm and cold a low score, perhaps -0.7. We place the unrelated word philosophy perfectly in the middle, at 0.0, and the rest would probably look something like this:
Warm | Hot | Cold | Philosophy | |
---|---|---|---|---|
Warm | 1.0 | |||
Hot | 0.8 | 1.0 | ||
Cold | -.7 | -.8 | 1.0 | |
Philosophy | 0.0 | 0.0 | 0.0 | 1.0 |
If we had to assign a pairwise score based on whether the worlds are related or not it might we would expect cold to be more related to words like warm and hot than to philosophy:
Warm | Hot | Cold | Philosophy | |
---|---|---|---|---|
Warm | 1.0 | |||
Hot | 0.9 | 1.0 | ||
Cold | 0.7 | 0.8 | 1.0 | |
Philosophy | 0.0 | 0.0 | 0.0 | 1.0 |
The difficulty with “semantic similarity” is that it is often used without a clear definition and varies with context. It can range from detecting synonyms (disappear—vanish), identifying texts with identical meanings, recognizing related words (coffee—cup), grouping similar concepts (cat—dog), to associating topics…
However, if we only provide our poor embedding model with the words, it’s impossible to determine whether we want a “sensation-based” approach or a “relation-based” approach. If only there were a way to instruct the model on what we actually want to compare.
Su et al., 2022 proposed adding another input to the embedding model. This second input allows us to specify the type of similarity we want to measure. We can instruct the model on what is important in our specific context. For our previous example, it might look something like this:
emb_warm = model.encode(
"warm",
instruction = "Classify a given temperature by its sensation.")
emb_cold = model.encode(
"cold",
instruction = "Classify a given temperature by its sensation.")
sim = cos(emb_warm,emb_cold)
print(sim) #-> -0.7
Even though we embedded the same words as before, we now expect to get a completely different score (e.g., -0.7 instead of 0.8).
If they work, instructions are a great way to diversify what we can do with a model. Unfortunately, in my experience, most embedding models struggle with instructions and settings that deviate from their training examples. So as long as your particular task is covered by existing benchmarks, you’re good to go. However, the example of “scoring temperature based on its sensation” is quite unconventional. I assume this sort of scoring hasn’t received much attention during instruction fine-tuning. Or perhaps that’s just an excuse… Whatever the reason, I have struggled to recreate our intuitive scores from the table above in most models. In fact, for many models, using instructions outside of the pretrained examples barely affects the similarity values. Some models, especially smaller ones2, intentionally limit themselves to either no instructions at all or just two instructions. It seems they might be optimizing for existing benchmarks within a fixed parameter set.
Over the past year, I have experimented with various instruction models and have been consistently underwhelmed by the impact of the instructions. However, the current top-rated open-source model has been quite impressive.
In fact, it has been the only model I’ve managed to convince that, in a certain context, warm—cold might be less similar than warm—philosophy. Using the following instructions, we get:
No Instr:
Sensation: Classify a given temperature by its sensation.
Word Pair | No Instr | Sensation |
---|---|---|
warm - warm | 1.00 | 1.00 |
warm - hot | 0.89 | 0.92 |
warm - cold | 0.83 | 0.820 |
warm - Philosophy | 0.67 | 0.841 |
We can see that using no instruction basically already gives us the relational picture. So no point creating a separate instruction for that. The fact would we managed to push the model to score warm–philosophy higher than it scores warm–cold is truly impressive.
If you’ve had your own experiences with instruction and embedding models, I’d love to hear your thoughts in the comments on Reddit.
We’ve already seen how the instruction changes the similarity score. But we can also visualize how the embeddings shift in vector space by plotting them. One of the most distinguishing features is the language of a text. We can determine the vector, or direction, with the most variation using Singular Value Decomposition (SVD). In other words these are the direction in the space would have the most influence on our similarity metric. First we need some embeddings so let’s take some german and english sentences that are translation pairs of each other. Each sentence pair should be close in meaning since it is a translation. Lets embed them without instructions first. Projecting our high dimensional embeddings onto the two most influential directions gives us the following two dimensional plot:
A very clear distinction between the languages becomes immediately apparent. While this makes it easy to classify texts by language, it’s not particularly useful if we want to find translation pairs. By the way, I’ve tried this with several embedding models, and all of them show a clear distinction between languages on the first two principal components.
The real fun part starts when we now add the instruction Retrieve semantically similar text. Between german and english translations
.
Well I think the figure says it all. Literally impressive how much the vector space has shifted based on our instruction. Even just in the first two principal components we can already see what finding translation pairs is going to be much easier.
I hope this was somewhat interesting to read and I would love to get some feedback or some of you experience with instruction models in the comments on reddit.
Comments will be on reddit later.
1. A good place to start is the Stanford NPL Course or specifically the chapter about vector embeddings. ↩
2. For the first embedding models like word2vec it was not obvious that the cosine similarity should be used. However, nowadays many embedding models are specifically trained to be used with cosine similarity. ↩
3. Less than 3B parameter on MTEB. As of Jan 2025: dunzhang/stella_en_1.5B_v5, dunzhang/stella_en_400M_v5, Alibaba-NLP/gte-large-en-v1.5, jinaai/jina-embeddings-v3 ↩
It tried using the same instructions across different models for way to long . . . but every model needs different instructions! In order to write good instructions for a model we need to understand what instructions have been used during training and how they are structured. I find it the easiest is to lookup the instructions a model used to compete on the MTEB.
For example during retrieval tasks the NV-Embed-V2 use no instruction at all for documents and only add instructions for the queries. Some models use different instructions for the documents and the queries. For classification task it seems that most models use the same instruction for all samples. I believe it is still beneficial to stay close to the wording used in other instructions.