Multi-vector (late-interaction) retrieval represents each document (and query) as a variable-length set of embedding vectors rather than a single embedding vector.
Instead of reducing the input to a single vector via global pooling (e.g., mean pooling) or a dedicated summary token (e.g., CLS), multi-vector representations preserve token-, segment-, or patch-level vectors and score documents using a set-wise similarity function, most commonly MaxSim.
Why multi-vector retrieval is needed
Single-vector retrieval is the most common approach, but it can lose signal when:
- The document is long: pooling compresses many distinct concepts into one vector, potentially sacrificing important details in order to preserve overall meaning.
- The query has multiple aspects: a single vector tends to overemphasize features repeated across the document and suppress distinctive facets that appear only in localized context but are crucial for precision.
- Fine-grained matching matters: named entities, code identifiers, rare terms, or localized image regions are easily lost in single-vector retrieval but can be easily retrieved from token- or patch-level embeddings.
Multi-vector retrieval helps because it evaluates (maximum) relevance of any query vector against some document vector, instead of computing a single global similarity score.
Multi-vector (tensor) embeddings
Embedding models generally represent data internally as an N x D matrix:
- Token-level embeddings: each of
N text tokens is represented by a D-dimensional vector.
- Patch / region embeddings (vision / multimodal): each of
N patches or regions is represented by a D-dimensional vector.
- Segmented / multi-field encoders: multiple embeddings per input (e.g., each of
N paragraphs or sections is represented by a D-dimensional vector).
Distinct from traditional embedding models that pool these internal representations in the output layer to produce a single vector, late-interaction models such as
ColBERTv2 (text) and
ColPali (visual/multimodal)
produce a full N x D matrix of token- or patch-level embeddings at the output layer (potentially projected to a lower dimension, quantized, or otherwise compressed), which is then stored in the database for retrieval.
MaxSim scoring
For a query Q represented by M vectors {q_1,...,q_M} and a document D represented by N vectors {d_1,...,d_N} (of dimension D), the MaxSim function computes the similarity between Q and D as
MaxSim(Q, D) = sum_i max_j ⟨q_i, d_j⟩
In human words, instead of computing a single overall similarity score between the entire query and document embeddings, MaxSim compares each query token to all document tokens and keeps only the maximum similarity for each query token. The final score is the sum (or average) of these maximum similarities.
With metric="maxsim", higher scores indicate better matches.
Define a schema for multi-vector retrieval
In TopK, multi-vector embeddings are stored in a matrix() field and indexed with a multi_vector_index() using the maxsim metric.
from topk_sdk.schema import text, matrix, multi_vector_index
client.collections().create(
"passages",
schema={
"content": text().required(),
# Each row is one embedding vector; columns == dimension.
"token_embeddings": matrix(dimension=128, value_type="f16").index(
multi_vector_index(metric="maxsim")
),
},
)
dimension: the number of columns D in your N x D embedding matrix (e.g. 128, 768, 1024).
value_type: storage type for matrix elements (f32, f16, f8, u8, i8). Choose based on your model output and memory/perf needs.
metric="maxsim": a late-interaction style scoring where each query vector contributes based on its best match in the document.
multi_vector_index() also accepts optional quantization ("1bit", "2bit", "scalar"), width, and top_k for tuning; see Optimization tips below.
Ingest documents with multi-vector embeddings
When upserting documents, include both:
- A multi-vector embedding for given content
- Any relevant document metadata
Each document must also include a required _id field.
import numpy as np
from topk_sdk.data import matrix
# Example: shape (num_vectors, dimension)
token_embeddings = np.random.randn(12, 128).astype(np.float16)
client.collection("passages").upsert(
[
{
"_id": "p1",
"content": "Late interaction retrieval",
# `numpy.ndarray` is supported out of the box
"token_embeddings": token_embeddings,
},
{
"_id": "p2",
"content": "MaxSim in practice",
# Or use matrix data constructor via `topk_sdk.data.matrix(...)`
"token_embeddings": matrix(token_embeddings, value_type="f16"),
},
]
)
TopK has built-in support for numpy.ndarray when ingesting and querying multi-vector (matrix) embeddings. If you pass an ndarray directly, the matrix value type is inferred from its dtype (e.g. float32, float16, uint8, int8).
Run multi-vector retrieval
Use fn.multi_vector_distance() to score documents against a query matrix.
When the field is indexed with metric="maxsim", this computes the MaxSim score defined above.
import numpy as np
from topk_sdk.query import field, fn, select
query_matrix = np.random.randn(8, 128).astype(np.float16)
docs = client.collection("passages").query(
select(
"content",
maxsim=fn.multi_vector_distance(
"token_embeddings",
query_matrix, # `numpy.ndarray` is supported in query matrix as well
candidates=200, # optional: tuning performance vs. recall
),
).topk(field("maxsim"), 10)
)
print(docs)
# Results:
[
{
"_id": "p2",
"content": "MaxSim in practice",
"maxsim": 0.83,
},
{
"_id": "p1",
"content": "Late interaction retrieval",
"maxsim": 0.79,
},
]
Optimization tips
The multi-vector index has an approximate retrieval stage (for faster pruning) followed by a more accurate scoring stage. These parameters let you trade off memory, latency, and recall:
width: width of the sparse projection used for approximate MaxSim pruning.
- Higher width: more accurate pruning → fewer false negatives (better recall), but higher memory/compute.
- Lower width: more aggressive approximation → more false negatives (recall loss), but faster and smaller.
top_k: number of top projected values to keep during the approximate stage.
quantization: compresses stored multi-vector values.
1bit / 2bit: very compact and fast, but most approximate.
scalar: higher-fidelity quantization (larger than 1–2 bit) with better quality.
candidates: (query parameter) controls how many top document candidates from the approximate stage are promoted to a more accurate multi-vector scoring pass. Lower values reduce work (faster/cheaper) but can hurt recall; higher values improve recall at the cost of latency.
If you’re starting out, keep defaults, then tune candidates and these index parameters (width, top_k, quantization) based on your latency or recall targets.