Multi-vector (late-interaction) retrieval represents each document (and query) as a variable-length set of embedding vectors rather than a single embedding vector. Instead of reducing the input to a single vector via global pooling (e.g., mean pooling) or a dedicated summary token (e.g., CLS), multi-vector representations preserve token-, segment-, or patch-level vectors and score documents using a set-wise similarity function, most commonly MaxSim.Documentation Index
Fetch the complete documentation index at: https://docs.topk.io/llms.txt
Use this file to discover all available pages before exploring further.
Why multi-vector retrieval is needed
Single-vector retrieval is the most common approach, but it can lose signal when:- The document is long: pooling compresses many distinct concepts into one vector, potentially sacrificing important details in order to preserve overall meaning.
- The query has multiple aspects: a single vector tends to overemphasize features repeated across the document and suppress distinctive facets that appear only in localized context but are crucial for precision.
- Fine-grained matching matters: named entities, code identifiers, rare terms, or localized image regions are easily lost in single-vector retrieval but can be easily retrieved from token- or patch-level embeddings.
Multi-vector (tensor) embeddings
Embedding models generally represent data internally as anN x D matrix:
- Token-level embeddings: each of
Ntext tokens is represented by aD-dimensional vector. - Patch / region embeddings (vision / multimodal): each of
Npatches or regions is represented by aD-dimensional vector. - Segmented / multi-field encoders: multiple embeddings per input (e.g., each of
Nparagraphs or sections is represented by aD-dimensional vector).
ColBERTv2 (text) and
ColPali (visual/multimodal)
produce a full N x D matrix of token- or patch-level embeddings at the output layer (potentially projected to a lower dimension, quantized, or otherwise compressed), which is then stored in the database for retrieval.
MaxSim scoring
For a queryQ represented by M vectors {q_1,...,q_M} and a document D represented by N vectors {d_1,...,d_N} (of dimension D), the MaxSim function computes the similarity between Q and D as
metric="maxsim", higher scores indicate better matches.
Define a schema for multi-vector retrieval
In TopK, multi-vector embeddings are stored in amatrix() field and indexed with a multi_vector_index() using the maxsim metric.
dimension: the number of columnsDin yourN x Dembedding matrix (e.g. 128, 768, 1024).value_type: storage type for matrix elements (f32,f16,f8,u8,i8). Choose based on your model output and memory/perf needs.metric="maxsim": a late-interaction style scoring where each query vector contributes based on its best match in the document.multi_vector_index()also accepts optionalquantization("1bit","2bit","scalar"),width, andtop_kfor tuning; see Optimization tips below.
Ingest documents with multi-vector embeddings
When upserting documents, include both:- A multi-vector embedding for given content
- Any relevant document metadata
Run multi-vector retrieval
Usefn.multi_vector_distance() to score documents against a query matrix.
When the field is indexed with metric="maxsim", this computes the MaxSim score defined above.
Optimization tips
The multi-vector index has an approximate retrieval stage (for faster pruning) followed by a more accurate scoring stage. These parameters let you trade off memory, latency, and recall:width: width of the sparse projection used for approximate MaxSim pruning.- Higher width: more accurate pruning → fewer false negatives (better recall), but higher memory/compute.
- Lower width: more aggressive approximation → more false negatives (recall loss), but faster and smaller.
top_k: number of top projected values to keep during the approximate stage.quantization: compresses stored multi-vector values.1bit/2bit: very compact and fast, but most approximate.scalar: higher-fidelity quantization (larger than 1–2 bit) with better quality.
candidates: (query parameter) controls how many top document candidates from the approximate stage are promoted to a more accurate multi-vector scoring pass. Lower values reduce work (faster/cheaper) but can hurt recall; higher values improve recall at the cost of latency. If you’re starting out, keep defaults, then tunecandidatesand these index parameters (width,top_k,quantization) based on your latency or recall targets.